* [PATCH 0/9] powerpc32: set of optimisation of network checksum functions
@ 2015-09-22 14:34 Christophe Leroy
2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
` (9 more replies)
0 siblings, 10 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
This patch serie gather patches related to checksum functions on powerpc.
Some of those patches have already been submitted individually.
Christophe Leroy (9):
powerpc: unexport csum_tcpudp_magic
powerpc: mark xer clobbered in csum_add()
powerpc32: checksum_wrappers_64 becomes checksum_wrappers
powerpc: inline ip_fast_csum()
powerpc32: rewrite csum_partial_copy_generic() based on
copy_tofrom_user()
powerpc32: optimise a few instructions in csum_partial()
powerpc32: optimise csum_partial() loop
powerpc: simplify csum_add(a, b) in case a or b is constant 0
powerpc: optimise csum_partial() call when len is constant
arch/powerpc/include/asm/checksum.h | 143 +++++---
arch/powerpc/lib/Makefile | 3 +-
arch/powerpc/lib/checksum_32.S | 398 +++++++++++++--------
arch/powerpc/lib/checksum_64.S | 31 +-
...{checksum_wrappers_64.c => checksum_wrappers.c} | 0
arch/powerpc/lib/ppc_ksyms.c | 4 +-
6 files changed, 350 insertions(+), 229 deletions(-)
rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} (100%)
--
2.1.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/9] powerpc: unexport csum_tcpudp_magic
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
` (8 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
csum_tcpudp_magic is now an inline function, so there is
nothing to export
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/ppc_ksyms.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index c7f8e95..f5e427e 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -20,7 +20,6 @@ EXPORT_SYMBOL(strncmp);
EXPORT_SYMBOL(csum_partial);
EXPORT_SYMBOL(csum_partial_copy_generic);
EXPORT_SYMBOL(ip_fast_csum);
-EXPORT_SYMBOL(csum_tcpudp_magic);
#endif
EXPORT_SYMBOL(__copy_tofrom_user);
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/9] powerpc: mark xer clobbered in csum_add()
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
` (7 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
addc uses carry so xer is clobbered in csum_add()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/include/asm/checksum.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index e8d9ef4..d2ca07b 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -141,7 +141,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
#else
asm("addc %0,%0,%1;"
"addze %0,%0;"
- : "+r" (csum) : "r" (addend));
+ : "+r" (csum) : "r" (addend) : "xer");
return csum;
#endif
}
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-10-23 3:26 ` Scott Wood
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
` (6 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
The powerpc64 checksum wrapper functions adds csum_and_copy_to_user()
which otherwise is implemented in include/net/checksum.h by using
csum_partial() then copy_to_user()
Those two wrapper fonctions are also applicable to powerpc32 as it is
based on the use of csum_partial_copy_generic() which also
exists on powerpc32
This patch renames arch/powerpc/lib/checksum_wrappers_64.c to
arch/powerpc/lib/checksum_wrappers.c and
makes it non-conditional to CONFIG_WORD_SIZE
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/include/asm/checksum.h | 9 ---------
arch/powerpc/lib/Makefile | 3 +--
arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} | 0
3 files changed, 1 insertion(+), 11 deletions(-)
rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} (100%)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index d2ca07b..afa6722 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -47,21 +47,12 @@ extern __wsum csum_partial_copy_generic(const void *src, void *dst,
int len, __wsum sum,
int *src_err, int *dst_err);
-#ifdef __powerpc64__
#define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER
extern __wsum csum_and_copy_from_user(const void __user *src, void *dst,
int len, __wsum sum, int *err_ptr);
#define HAVE_CSUM_COPY_USER
extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
int len, __wsum sum, int *err_ptr);
-#else
-/*
- * the same as csum_partial, but copies from src to dst while it
- * checksums.
- */
-#define csum_partial_copy_from_user(src, dst, len, sum, errp) \
- csum_partial_copy_generic((__force const void *)(src), (dst), (len), (sum), (errp), NULL)
-#endif
#define csum_partial_copy_nocheck(src, dst, len, sum) \
csum_partial_copy_generic((src), (dst), (len), (sum), NULL, NULL)
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index a47e142..e46b068 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -22,8 +22,7 @@ obj64-$(CONFIG_SMP) += locks.o
obj64-$(CONFIG_ALTIVEC) += vmx-helper.o
ifeq ($(CONFIG_GENERIC_CSUM),)
-obj-y += checksum_$(CONFIG_WORD_SIZE).o
-obj-$(CONFIG_PPC64) += checksum_wrappers_64.o
+obj-y += checksum_$(CONFIG_WORD_SIZE).o checksum_wrappers.o
endif
obj-$(CONFIG_PPC_EMULATE_SSTEP) += sstep.o ldstfp.o
diff --git a/arch/powerpc/lib/checksum_wrappers_64.c b/arch/powerpc/lib/checksum_wrappers.c
similarity index 100%
rename from arch/powerpc/lib/checksum_wrappers_64.c
rename to arch/powerpc/lib/checksum_wrappers.c
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/9] powerpc: inline ip_fast_csum()
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (2 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-09-23 5:43 ` Denis Kirjanov
2016-03-05 3:50 ` [4/9] " Scott Wood
2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
` (5 subsequent siblings)
9 siblings, 2 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
In several architectures, ip_fast_csum() is inlined
There are functions like ip_send_check() which do nothing
much more than calling ip_fast_csum().
Inlining ip_fast_csum() allows the compiler to optimise better
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/include/asm/checksum.h | 46 +++++++++++++++++++++++++++++++------
arch/powerpc/lib/checksum_32.S | 21 -----------------
arch/powerpc/lib/checksum_64.S | 27 ----------------------
arch/powerpc/lib/ppc_ksyms.c | 1 -
4 files changed, 39 insertions(+), 56 deletions(-)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index afa6722..56deea8 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -9,16 +9,9 @@
* 2 of the License, or (at your option) any later version.
*/
-/*
- * This is a version of ip_compute_csum() optimized for IP headers,
- * which always checksum on 4 octet boundaries. ihl is the number
- * of 32-bit words and is always >= 5.
- */
#ifdef CONFIG_GENERIC_CSUM
#include <asm-generic/checksum.h>
#else
-extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
-
/*
* computes the checksum of a memory block at buff, length len,
* and adds in "sum" (32-bit)
@@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
#endif
}
+/*
+ * This is a version of ip_compute_csum() optimized for IP headers,
+ * which always checksum on 4 octet boundaries. ihl is the number
+ * of 32-bit words and is always >= 5.
+ */
+static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int ihl)
+{
+ u32 *ptr = (u32 *)iph + 1;
+#ifdef __powerpc64__
+ unsigned int i;
+ u64 s = *(__force u32 *)iph;
+
+ for (i = 0; i < ihl - 1; i++, ptr++)
+ s += *ptr;
+ s += (s >> 32);
+ return (__force __wsum)s;
+
+#else
+ __wsum sum, tmp;
+
+ asm("mtctr %3;"
+ "addc %0,%4,%5;"
+ "1:lwzu %1, 4(%2);"
+ "adde %0,%0,%1;"
+ "bdnz 1b;"
+ "addze %0,%0;"
+ : "=r"(sum), "=r"(tmp), "+b"(ptr)
+ : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
+ : "ctr", "xer", "memory");
+
+ return sum;
+#endif
+}
+
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+{
+ return csum_fold(ip_fast_csum_nofold(iph, ihl));
+}
+
#endif
#endif /* __KERNEL__ */
#endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 6d67e05..0d7eba3 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -20,27 +20,6 @@
.text
/*
- * ip_fast_csum(buf, len) -- Optimized for IP header
- * len is in words and is always >= 5.
- */
-_GLOBAL(ip_fast_csum)
- lwz r0,0(r3)
- lwzu r5,4(r3)
- addic. r4,r4,-2
- addc r0,r0,r5
- mtctr r4
- blelr-
-1: lwzu r4,4(r3)
- adde r0,r0,r4
- bdnz 1b
- addze r0,r0 /* add in final carry */
- rlwinm r3,r0,16,0,31 /* fold two halves together */
- add r3,r0,r3
- not r3,r3
- srwi r3,r3,16
- blr
-
-/*
* computes the checksum of a memory block at buff, length len,
* and adds in "sum" (32-bit)
*
diff --git a/arch/powerpc/lib/checksum_64.S b/arch/powerpc/lib/checksum_64.S
index f3ef354..f53f4ab 100644
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -18,33 +18,6 @@
#include <asm/ppc_asm.h>
/*
- * ip_fast_csum(r3=buf, r4=len) -- Optimized for IP header
- * len is in words and is always >= 5.
- *
- * In practice len == 5, but this is not guaranteed. So this code does not
- * attempt to use doubleword instructions.
- */
-_GLOBAL(ip_fast_csum)
- lwz r0,0(r3)
- lwzu r5,4(r3)
- addic. r4,r4,-2
- addc r0,r0,r5
- mtctr r4
- blelr-
-1: lwzu r4,4(r3)
- adde r0,r0,r4
- bdnz 1b
- addze r0,r0 /* add in final carry */
- rldicl r4,r0,32,0 /* fold two 32-bit halves together */
- add r0,r0,r4
- srdi r0,r0,32
- rlwinm r3,r0,16,0,31 /* fold two halves together */
- add r3,r0,r3
- not r3,r3
- srwi r3,r3,16
- blr
-
-/*
* Computes the checksum of a memory block at buff, length len,
* and adds in "sum" (32-bit).
*
diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index f5e427e..8cd5c0b 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -19,7 +19,6 @@ EXPORT_SYMBOL(strncmp);
#ifndef CONFIG_GENERIC_CSUM
EXPORT_SYMBOL(csum_partial);
EXPORT_SYMBOL(csum_partial_copy_generic);
-EXPORT_SYMBOL(ip_fast_csum);
#endif
EXPORT_SYMBOL(__copy_tofrom_user);
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (3 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
` (4 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
csum_partial_copy_generic() does the same as copy_tofrom_user and also
calculates the checksum during the copy. Unlike copy_tofrom_user(),
the existing version of csum_partial_copy_generic() doesn't take
benefit of the cache.
This patch is a rewrite of csum_partial_copy_generic() based on
copy_tofrom_user().
The previous version of csum_partial_copy_generic() was handling
errors. Now we have the checksum wrapper functions to handle the error
case like in powerpc64 so we can make the error case simple:
just return -EFAULT.
copy_tofrom_user() only has r12 available => we use it for the
checksum r7 and r8 which contains pointers to error feedback are used,
so we stack them.
On a TCP benchmark using socklib on the loopback interface on which
checksum offload and scatter/gather have been deactivated, we get
about 20% performance increase.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/checksum_32.S | 320 +++++++++++++++++++++++++++--------------
1 file changed, 209 insertions(+), 111 deletions(-)
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 0d7eba3..3472372 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -14,6 +14,7 @@
#include <linux/sys.h>
#include <asm/processor.h>
+#include <asm/cache.h>
#include <asm/errno.h>
#include <asm/ppc_asm.h>
@@ -66,123 +67,220 @@ _GLOBAL(csum_partial)
*
* csum_partial_copy_generic(src, dst, len, sum, src_err, dst_err)
*/
+#define CSUM_COPY_16_BYTES_WITHEX(n) \
+8 ## n ## 0: \
+ lwz r7,4(r4); \
+8 ## n ## 1: \
+ lwz r8,8(r4); \
+8 ## n ## 2: \
+ lwz r9,12(r4); \
+8 ## n ## 3: \
+ lwzu r10,16(r4); \
+8 ## n ## 4: \
+ stw r7,4(r6); \
+ adde r12,r12,r7; \
+8 ## n ## 5: \
+ stw r8,8(r6); \
+ adde r12,r12,r8; \
+8 ## n ## 6: \
+ stw r9,12(r6); \
+ adde r12,r12,r9; \
+8 ## n ## 7: \
+ stwu r10,16(r6); \
+ adde r12,r12,r10
+
+#define CSUM_COPY_16_BYTES_EXCODE(n) \
+.section __ex_table,"a"; \
+ .align 2; \
+ .long 8 ## n ## 0b,src_error; \
+ .long 8 ## n ## 1b,src_error; \
+ .long 8 ## n ## 2b,src_error; \
+ .long 8 ## n ## 3b,src_error; \
+ .long 8 ## n ## 4b,dst_error; \
+ .long 8 ## n ## 5b,dst_error; \
+ .long 8 ## n ## 6b,dst_error; \
+ .long 8 ## n ## 7b,dst_error; \
+ .text
+
+ .text
+ .stabs "arch/powerpc/lib/",N_SO,0,0,0f
+ .stabs "checksum_32.S",N_SO,0,0,0f
+0:
+
+CACHELINE_BYTES = L1_CACHE_BYTES
+LG_CACHELINE_BYTES = L1_CACHE_SHIFT
+CACHELINE_MASK = (L1_CACHE_BYTES-1)
+
_GLOBAL(csum_partial_copy_generic)
- addic r0,r6,0
- subi r3,r3,4
- subi r4,r4,4
- srwi. r6,r5,2
- beq 3f /* if we're doing < 4 bytes */
- andi. r9,r4,2 /* Align dst to longword boundary */
- beq+ 1f
-81: lhz r6,4(r3) /* do 2 bytes to get aligned */
- addi r3,r3,2
- subi r5,r5,2
-91: sth r6,4(r4)
- addi r4,r4,2
- addc r0,r0,r6
- srwi. r6,r5,2 /* # words to do */
- beq 3f
-1: srwi. r6,r5,4 /* # groups of 4 words to do */
- beq 10f
- mtctr r6
-71: lwz r6,4(r3)
-72: lwz r9,8(r3)
-73: lwz r10,12(r3)
-74: lwzu r11,16(r3)
- adde r0,r0,r6
-75: stw r6,4(r4)
- adde r0,r0,r9
-76: stw r9,8(r4)
- adde r0,r0,r10
-77: stw r10,12(r4)
- adde r0,r0,r11
-78: stwu r11,16(r4)
- bdnz 71b
-10: rlwinm. r6,r5,30,30,31 /* # words left to do */
- beq 13f
- mtctr r6
-82: lwzu r9,4(r3)
-92: stwu r9,4(r4)
- adde r0,r0,r9
- bdnz 82b
-13: andi. r5,r5,3
-3: cmpwi 0,r5,2
- blt+ 4f
-83: lhz r6,4(r3)
- addi r3,r3,2
- subi r5,r5,2
-93: sth r6,4(r4)
+ stwu r1,-16(r1)
+ stw r7,12(r1)
+ stw r8,8(r1)
+
+ andi. r0,r4,1 /* is destination address even ? */
+ cmplwi cr7,r0,0
+ addic r12,r6,0
+ addi r6,r4,-4
+ neg r0,r4
+ addi r4,r3,-4
+ andi. r0,r0,CACHELINE_MASK /* # bytes to start of cache line */
+ beq 58f
+
+ cmplw 0,r5,r0 /* is this more than total to do? */
+ blt 63f /* if not much to do */
+ andi. r8,r0,3 /* get it word-aligned first */
+ mtctr r8
+ beq+ 61f
+ li r3,0
+70: lbz r9,4(r4) /* do some bytes */
+ addi r4,r4,1
+ slwi r3,r3,8
+ rlwimi r3,r9,0,24,31
+71: stb r9,4(r6)
+ addi r6,r6,1
+ bdnz 70b
+ adde r12,r12,r3
+61: subf r5,r0,r5
+ srwi. r0,r0,2
+ mtctr r0
+ beq 58f
+72: lwzu r9,4(r4) /* do some words */
+ adde r12,r12,r9
+73: stwu r9,4(r6)
+ bdnz 72b
+
+58: srwi. r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+ clrlwi r5,r5,32-LG_CACHELINE_BYTES
+ li r11,4
+ beq 63f
+
+ /* Here we decide how far ahead to prefetch the source */
+ li r3,4
+ cmpwi r0,1
+ li r7,0
+ ble 114f
+ li r7,1
+#if MAX_COPY_PREFETCH > 1
+ /* Heuristically, for large transfers we prefetch
+ MAX_COPY_PREFETCH cachelines ahead. For small transfers
+ we prefetch 1 cacheline ahead. */
+ cmpwi r0,MAX_COPY_PREFETCH
+ ble 112f
+ li r7,MAX_COPY_PREFETCH
+112: mtctr r7
+111: dcbt r3,r4
+ addi r3,r3,CACHELINE_BYTES
+ bdnz 111b
+#else
+ dcbt r3,r4
+ addi r3,r3,CACHELINE_BYTES
+#endif /* MAX_COPY_PREFETCH > 1 */
+
+114: subf r8,r7,r0
+ mr r0,r7
+ mtctr r8
+
+53: dcbt r3,r4
+54: dcbz r11,r6
+/* the main body of the cacheline loop */
+ CSUM_COPY_16_BYTES_WITHEX(0)
+#if L1_CACHE_BYTES >= 32
+ CSUM_COPY_16_BYTES_WITHEX(1)
+#if L1_CACHE_BYTES >= 64
+ CSUM_COPY_16_BYTES_WITHEX(2)
+ CSUM_COPY_16_BYTES_WITHEX(3)
+#if L1_CACHE_BYTES >= 128
+ CSUM_COPY_16_BYTES_WITHEX(4)
+ CSUM_COPY_16_BYTES_WITHEX(5)
+ CSUM_COPY_16_BYTES_WITHEX(6)
+ CSUM_COPY_16_BYTES_WITHEX(7)
+#endif
+#endif
+#endif
+ bdnz 53b
+ cmpwi r0,0
+ li r3,4
+ li r7,0
+ bne 114b
+
+63: srwi. r0,r5,2
+ mtctr r0
+ beq 64f
+30: lwzu r0,4(r4)
+ adde r12,r12,r0
+31: stwu r0,4(r6)
+ bdnz 30b
+
+64: andi. r0,r5,2
+ beq+ 65f
+40: lhz r0,4(r4)
addi r4,r4,2
- adde r0,r0,r6
-4: cmpwi 0,r5,1
- bne+ 5f
-84: lbz r6,4(r3)
-94: stb r6,4(r4)
- slwi r6,r6,8 /* Upper byte of word */
- adde r0,r0,r6
-5: addze r3,r0 /* add in final carry */
+41: sth r0,4(r6)
+ adde r12,r12,r0
+ addi r6,r6,2
+65: andi. r0,r5,1
+ beq+ 66f
+50: lbz r0,4(r4)
+51: stb r0,4(r6)
+ slwi r0,r0,8
+ adde r12,r12,r0
+66: addze r3,r12
+ addi r1,r1,16
+ beqlr+ cr7
+ rlwinm r3,r3,8,0,31 /* swap bytes for odd destination */
blr
-/* These shouldn't go in the fixup section, since that would
- cause the ex_table addresses to get out of order. */
-
-src_error_4:
- mfctr r6 /* update # bytes remaining from ctr */
- rlwimi r5,r6,4,0,27
- b 79f
-src_error_1:
- li r6,0
- subi r5,r5,2
-95: sth r6,4(r4)
- addi r4,r4,2
-79: srwi. r6,r5,2
- beq 3f
- mtctr r6
-src_error_2:
- li r6,0
-96: stwu r6,4(r4)
- bdnz 96b
-3: andi. r5,r5,3
- beq src_error
-src_error_3:
- li r6,0
- mtctr r5
- addi r4,r4,3
-97: stbu r6,1(r4)
- bdnz 97b
+/* read fault */
src_error:
- cmpwi 0,r7,0
- beq 1f
- li r6,-EFAULT
- stw r6,0(r7)
-1: addze r3,r0
+ lwz r7,12(r1)
+ addi r1,r1,16
+ cmpwi cr0,r7,0
+ beqlr
+ li r0,-EFAULT
+ stw r0,0(r7)
blr
-
+/* write fault */
dst_error:
- cmpwi 0,r8,0
- beq 1f
- li r6,-EFAULT
- stw r6,0(r8)
-1: addze r3,r0
+ lwz r8,8(r1)
+ addi r1,r1,16
+ cmpwi cr0,r8,0
+ beqlr
+ li r0,-EFAULT
+ stw r0,0(r8)
blr
-.section __ex_table,"a"
- .long 81b,src_error_1
- .long 91b,dst_error
- .long 71b,src_error_4
- .long 72b,src_error_4
- .long 73b,src_error_4
- .long 74b,src_error_4
- .long 75b,dst_error
- .long 76b,dst_error
- .long 77b,dst_error
- .long 78b,dst_error
- .long 82b,src_error_2
- .long 92b,dst_error
- .long 83b,src_error_3
- .long 93b,dst_error
- .long 84b,src_error_3
- .long 94b,dst_error
- .long 95b,dst_error
- .long 96b,dst_error
- .long 97b,dst_error
+ .section __ex_table,"a"
+ .align 2
+ .long 70b,src_error
+ .long 71b,dst_error
+ .long 72b,src_error
+ .long 73b,dst_error
+ .long 54b,dst_error
+ .text
+
+/*
+ * this stuff handles faults in the cacheline loop and branches to either
+ * src_error (if in read part) or dst_error (if in write part)
+ */
+ CSUM_COPY_16_BYTES_EXCODE(0)
+#if L1_CACHE_BYTES >= 32
+ CSUM_COPY_16_BYTES_EXCODE(1)
+#if L1_CACHE_BYTES >= 64
+ CSUM_COPY_16_BYTES_EXCODE(2)
+ CSUM_COPY_16_BYTES_EXCODE(3)
+#if L1_CACHE_BYTES >= 128
+ CSUM_COPY_16_BYTES_EXCODE(4)
+ CSUM_COPY_16_BYTES_EXCODE(5)
+ CSUM_COPY_16_BYTES_EXCODE(6)
+ CSUM_COPY_16_BYTES_EXCODE(7)
+#endif
+#endif
+#endif
+
+ .section __ex_table,"a"
+ .align 2
+ .long 30b,src_error
+ .long 31b,dst_error
+ .long 40b,src_error
+ .long 41b,dst_error
+ .long 50b,src_error
+ .long 51b,dst_error
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (4 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-10-23 3:30 ` Scott Wood
2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
` (3 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
r5 does contain the value to be updated, so lets use r5 all way long
for that. It makes the code more readable.
To avoid confusion, it is better to use adde instead of addc
The first addition is useless. Its only purpose is to clear carry.
As r4 is a signed int that is always positive, this can be done by
using srawi instead of srwi
Let's also remove the comment about bdnz having no overhead as it
is not correct on all powerpc, at least on MPC8xx
In the last part, in our situation, the remaining quantity of bytes
to be proceeded is between 0 and 3. Therefore, we can base that part
on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
then proceding on comparisons and substractions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 3472372..9c12602 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -27,35 +27,32 @@
* csum_partial(buff, len, sum)
*/
_GLOBAL(csum_partial)
- addic r0,r5,0
subi r3,r3,4
- srwi. r6,r4,2
+ srawi. r6,r4,2 /* Divide len by 4 and also clear carry */
beq 3f /* if we're doing < 4 bytes */
- andi. r5,r3,2 /* Align buffer to longword boundary */
+ andi. r0,r3,2 /* Align buffer to longword boundary */
beq+ 1f
- lhz r5,4(r3) /* do 2 bytes to get aligned */
- addi r3,r3,2
+ lhz r0,4(r3) /* do 2 bytes to get aligned */
subi r4,r4,2
- addc r0,r0,r5
+ addi r3,r3,2
srwi. r6,r4,2 /* # words to do */
+ adde r5,r5,r0
beq 3f
1: mtctr r6
-2: lwzu r5,4(r3) /* the bdnz has zero overhead, so it should */
- adde r0,r0,r5 /* be unnecessary to unroll this loop */
+2: lwzu r0,4(r3)
+ adde r5,r5,r0
bdnz 2b
- andi. r4,r4,3
-3: cmpwi 0,r4,2
- blt+ 4f
- lhz r5,4(r3)
+3: andi. r0,r4,2
+ beq+ 4f
+ lhz r0,4(r3)
addi r3,r3,2
- subi r4,r4,2
- adde r0,r0,r5
-4: cmpwi 0,r4,1
- bne+ 5f
- lbz r5,4(r3)
- slwi r5,r5,8 /* Upper byte of word */
- adde r0,r0,r5
-5: addze r3,r0 /* add in final carry */
+ adde r5,r5,r0
+4: andi. r0,r4,1
+ beq+ 5f
+ lbz r0,4(r3)
+ slwi r0,r0,8 /* Upper byte of word */
+ adde r5,r5,r0
+5: addze r3,r5 /* add in final carry */
blr
/*
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 7/9] powerpc32: optimise csum_partial() loop
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (5 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
` (2 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.
This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallele execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallele execution)
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/checksum_32.S | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 9c12602..0d34f47 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
srwi. r6,r4,2 /* # words to do */
adde r5,r5,r0
beq 3f
-1: mtctr r6
+1: andi. r6,r6,3 /* Prepare to handle words 4 by 4 */
+ beq 21f
+ mtctr r6
2: lwzu r0,4(r3)
adde r5,r5,r0
bdnz 2b
+21: srwi. r6,r4,4 /* # blocks of 4 words to do */
+ beq 3f
+ mtctr r6
+22: lwz r0,4(r3)
+ lwz r6,8(r3)
+ lwz r7,12(r3)
+ lwzu r8,16(r3)
+ adde r5,r5,r0
+ adde r5,r5,r6
+ adde r5,r5,r7
+ adde r5,r5,r8
+ bdnz 22b
3: andi. r0,r4,2
beq+ 4f
lhz r0,4(r3)
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (6 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-10-23 3:33 ` Scott Wood
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller
9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
Simplify csum_add(a, b) in case a or b is constant 0
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/include/asm/checksum.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index 56deea8..f8a9704 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
{
#ifdef __powerpc64__
u64 res = (__force u64)csum;
+#endif
+ if (__builtin_constant_p(csum) && csum == 0)
+ return addend;
+ if (__builtin_constant_p(addend) && addend == 0)
+ return csum;
+#ifdef __powerpc64__
res += (__force u64)addend;
return (__force __wsum)((u32)res + (res >> 32));
#else
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (7 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
2015-10-23 3:32 ` Scott Wood
2016-03-05 5:29 ` [9/9] " Scott Wood
2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller
9 siblings, 2 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, netdev
csum_partial is often called for small fixed length packets
for which it is suboptimal to use the generic csum_partial()
function.
For instance, in my configuration, I got:
* One place calling it with constant len 4
* Seven places calling it with constant len 8
* Three places calling it with constant len 14
* One place calling it with constant len 20
* One place calling it with constant len 24
* One place calling it with constant len 32
This patch renames csum_partial() to __csum_partial() and
implements csum_partial() as a wrapper inline function which
* uses csum_add() for small 16bits multiple constant length
* uses ip_fast_csum() for other 32bits multiple constant
* uses __csum_partial() in all other cases
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/include/asm/checksum.h | 80 ++++++++++++++++++++++++++-----------
arch/powerpc/lib/checksum_32.S | 4 +-
arch/powerpc/lib/checksum_64.S | 4 +-
arch/powerpc/lib/ppc_ksyms.c | 2 +-
4 files changed, 62 insertions(+), 28 deletions(-)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index f8a9704..25c4657f 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -13,20 +13,6 @@
#include <asm-generic/checksum.h>
#else
/*
- * computes the checksum of a memory block at buff, length len,
- * and adds in "sum" (32-bit)
- *
- * returns a 32-bit number suitable for feeding into itself
- * or csum_tcpudp_magic
- *
- * this function must be called with even lengths, except
- * for the last fragment, which may be odd
- *
- * it's best to have buff aligned on a 32-bit boundary
- */
-extern __wsum csum_partial(const void *buff, int len, __wsum sum);
-
-/*
* Computes the checksum of a memory block at src, length len,
* and adds in "sum" (32-bit), while copying the block to dst.
* If an access exception occurs on src or dst, it stores -EFAULT
@@ -67,15 +53,6 @@ static inline __sum16 csum_fold(__wsum sum)
return (__force __sum16)(~((__force u32)sum + tmp) >> 16);
}
-/*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
- */
-static inline __sum16 ip_compute_csum(const void *buff, int len)
-{
- return csum_fold(csum_partial(buff, len, 0));
-}
-
static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
unsigned short len,
unsigned short proto,
@@ -175,6 +152,63 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
return csum_fold(ip_fast_csum_nofold(iph, ihl));
}
+/*
+ * computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * this function must be called with even lengths, except
+ * for the last fragment, which may be odd
+ *
+ * it's best to have buff aligned on a 32-bit boundary
+ */
+__wsum __csum_partial(const void *buff, int len, __wsum sum);
+
+static inline __wsum csum_partial(const void *buff, int len, __wsum sum)
+{
+ if (__builtin_constant_p(len) && len == 0)
+ return sum;
+
+ if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
+ __wsum sum1;
+
+ if (len == 2)
+ sum1 = (__force u32)*(u16 *)buff;
+ if (len >= 4)
+ sum1 = *(u32 *)buff;
+ if (len == 6)
+ sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 4));
+ if (len >= 8)
+ sum1 = csum_add(sum1, *(u32 *)(buff + 4));
+ if (len == 10)
+ sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 8));
+ if (len >= 12)
+ sum1 = csum_add(sum1, *(u32 *)(buff + 8));
+ if (len == 14)
+ sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 12));
+ if (len >= 16)
+ sum1 = csum_add(sum1, *(u32 *)(buff + 12));
+
+ sum = csum_add(sum1, sum);
+ } else if (__builtin_constant_p(len) && (len & 3) == 0) {
+ sum = csum_add(ip_fast_csum_nofold(buff, len >> 2), sum);
+ } else {
+ sum = __csum_partial(buff, len, sum);
+ }
+ return sum;
+}
+
+/*
+ * this routine is used for miscellaneous IP-like checksums, mainly
+ * in icmp.c
+ */
+static inline __sum16 ip_compute_csum(const void *buff, int len)
+{
+ return csum_fold(csum_partial(buff, len, 0));
+}
+
#endif
#endif /* __KERNEL__ */
#endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 0d34f47..043d0088 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -24,9 +24,9 @@
* computes the checksum of a memory block at buff, length len,
* and adds in "sum" (32-bit)
*
- * csum_partial(buff, len, sum)
+ * do_csum_partial(buff, len, sum)
*/
-_GLOBAL(csum_partial)
+_GLOBAL(__csum_partial)
subi r3,r3,4
srawi. r6,r4,2 /* Divide len by 4 and also clear carry */
beq 3f /* if we're doing < 4 bytes */
diff --git a/arch/powerpc/lib/checksum_64.S b/arch/powerpc/lib/checksum_64.S
index f53f4ab..4ab562d 100644
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -21,9 +21,9 @@
* Computes the checksum of a memory block at buff, length len,
* and adds in "sum" (32-bit).
*
- * csum_partial(r3=buff, r4=len, r5=sum)
+ * do_csum_partial(r3=buff, r4=len, r5=sum)
*/
-_GLOBAL(csum_partial)
+_GLOBAL(__csum_partial)
addic r0,r5,0 /* clear carry */
srdi. r6,r4,3 /* less than 8 bytes? */
diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index 8cd5c0b..c422812 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -17,7 +17,7 @@ EXPORT_SYMBOL(strcmp);
EXPORT_SYMBOL(strncmp);
#ifndef CONFIG_GENERIC_CSUM
-EXPORT_SYMBOL(csum_partial);
+EXPORT_SYMBOL(__csum_partial);
EXPORT_SYMBOL(csum_partial_copy_generic);
#endif
--
2.1.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9] powerpc: inline ip_fast_csum()
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
@ 2015-09-23 5:43 ` Denis Kirjanov
2016-02-29 7:25 ` Christophe Leroy
2016-03-05 3:50 ` [4/9] " Scott Wood
1 sibling, 1 reply; 22+ messages in thread
From: Denis Kirjanov @ 2015-09-23 5:43 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
scottwood, linux-kernel, linuxppc-dev, netdev
On 9/22/15, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
> In several architectures, ip_fast_csum() is inlined
> There are functions like ip_send_check() which do nothing
> much more than calling ip_fast_csum().
> Inlining ip_fast_csum() allows the compiler to optimise better
Hi Christophe,
I did try it and see no difference on ppc64. Did you test with socklib
with modified loopback and if so do you have any numbers?
>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> arch/powerpc/include/asm/checksum.h | 46
> +++++++++++++++++++++++++++++++------
> arch/powerpc/lib/checksum_32.S | 21 -----------------
> arch/powerpc/lib/checksum_64.S | 27 ----------------------
> arch/powerpc/lib/ppc_ksyms.c | 1 -
> 4 files changed, 39 insertions(+), 56 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/checksum.h
> b/arch/powerpc/include/asm/checksum.h
> index afa6722..56deea8 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -9,16 +9,9 @@
> * 2 of the License, or (at your option) any later version.
> */
>
> -/*
> - * This is a version of ip_compute_csum() optimized for IP headers,
> - * which always checksum on 4 octet boundaries. ihl is the number
> - * of 32-bit words and is always >= 5.
> - */
> #ifdef CONFIG_GENERIC_CSUM
> #include <asm-generic/checksum.h>
> #else
> -extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
> -
> /*
> * computes the checksum of a memory block at buff, length len,
> * and adds in "sum" (32-bit)
> @@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum
> addend)
> #endif
> }
>
> +/*
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksum on 4 octet boundaries. ihl is the number
> + * of 32-bit words and is always >= 5.
> + */
> +static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int
> ihl)
> +{
> + u32 *ptr = (u32 *)iph + 1;
> +#ifdef __powerpc64__
> + unsigned int i;
> + u64 s = *(__force u32 *)iph;
> +
> + for (i = 0; i < ihl - 1; i++, ptr++)
> + s += *ptr;
> + s += (s >> 32);
> + return (__force __wsum)s;
> +
> +#else
> + __wsum sum, tmp;
> +
> + asm("mtctr %3;"
> + "addc %0,%4,%5;"
> + "1:lwzu %1, 4(%2);"
> + "adde %0,%0,%1;"
> + "bdnz 1b;"
> + "addze %0,%0;"
> + : "=r"(sum), "=r"(tmp), "+b"(ptr)
> + : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
> + : "ctr", "xer", "memory");
> +
> + return sum;
> +#endif
> +}
> +
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +{
> + return csum_fold(ip_fast_csum_nofold(iph, ihl));
> +}
> +
> #endif
> #endif /* __KERNEL__ */
> #endif
> diff --git a/arch/powerpc/lib/checksum_32.S
> b/arch/powerpc/lib/checksum_32.S
> index 6d67e05..0d7eba3 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -20,27 +20,6 @@
> .text
>
> /*
> - * ip_fast_csum(buf, len) -- Optimized for IP header
> - * len is in words and is always >= 5.
> - */
> -_GLOBAL(ip_fast_csum)
> - lwz r0,0(r3)
> - lwzu r5,4(r3)
> - addic. r4,r4,-2
> - addc r0,r0,r5
> - mtctr r4
> - blelr-
> -1: lwzu r4,4(r3)
> - adde r0,r0,r4
> - bdnz 1b
> - addze r0,r0 /* add in final carry */
> - rlwinm r3,r0,16,0,31 /* fold two halves together */
> - add r3,r0,r3
> - not r3,r3
> - srwi r3,r3,16
> - blr
> -
> -/*
> * computes the checksum of a memory block at buff, length len,
> * and adds in "sum" (32-bit)
> *
> diff --git a/arch/powerpc/lib/checksum_64.S
> b/arch/powerpc/lib/checksum_64.S
> index f3ef354..f53f4ab 100644
> --- a/arch/powerpc/lib/checksum_64.S
> +++ b/arch/powerpc/lib/checksum_64.S
> @@ -18,33 +18,6 @@
> #include <asm/ppc_asm.h>
>
> /*
> - * ip_fast_csum(r3=buf, r4=len) -- Optimized for IP header
> - * len is in words and is always >= 5.
> - *
> - * In practice len == 5, but this is not guaranteed. So this code does
> not
> - * attempt to use doubleword instructions.
> - */
> -_GLOBAL(ip_fast_csum)
> - lwz r0,0(r3)
> - lwzu r5,4(r3)
> - addic. r4,r4,-2
> - addc r0,r0,r5
> - mtctr r4
> - blelr-
> -1: lwzu r4,4(r3)
> - adde r0,r0,r4
> - bdnz 1b
> - addze r0,r0 /* add in final carry */
> - rldicl r4,r0,32,0 /* fold two 32-bit halves together */
> - add r0,r0,r4
> - srdi r0,r0,32
> - rlwinm r3,r0,16,0,31 /* fold two halves together */
> - add r3,r0,r3
> - not r3,r3
> - srwi r3,r3,16
> - blr
> -
> -/*
> * Computes the checksum of a memory block at buff, length len,
> * and adds in "sum" (32-bit).
> *
> diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
> index f5e427e..8cd5c0b 100644
> --- a/arch/powerpc/lib/ppc_ksyms.c
> +++ b/arch/powerpc/lib/ppc_ksyms.c
> @@ -19,7 +19,6 @@ EXPORT_SYMBOL(strncmp);
> #ifndef CONFIG_GENERIC_CSUM
> EXPORT_SYMBOL(csum_partial);
> EXPORT_SYMBOL(csum_partial_copy_generic);
> -EXPORT_SYMBOL(ip_fast_csum);
> #endif
>
> EXPORT_SYMBOL(__copy_tofrom_user);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/9] powerpc32: set of optimisation of network checksum functions
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
` (8 preceding siblings ...)
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
@ 2015-09-23 22:38 ` David Miller
9 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2015-09-23 22:38 UTC (permalink / raw)
To: christophe.leroy
Cc: benh, paulus, mpe, scottwood, linux-kernel, linuxppc-dev, netdev
From: Christophe Leroy <christophe.leroy@c-s.fr>
Date: Tue, 22 Sep 2015 16:34:17 +0200 (CEST)
> This patch serie gather patches related to checksum functions on powerpc.
> Some of those patches have already been submitted individually.
I'm assuming that the powerpc folks will integrate this series.
Let me know if I should take it into net-next instead.
Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
@ 2015-10-23 3:26 ` Scott Wood
2015-10-28 11:11 ` Anton Blanchard
0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23 3:26 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev, anton
On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> The powerpc64 checksum wrapper functions adds csum_and_copy_to_user()
> which otherwise is implemented in include/net/checksum.h by using
> csum_partial() then copy_to_user()
>
> Those two wrapper fonctions are also applicable to powerpc32 as it is
> based on the use of csum_partial_copy_generic() which also
> exists on powerpc32
>
> This patch renames arch/powerpc/lib/checksum_wrappers_64.c to
> arch/powerpc/lib/checksum_wrappers.c and
> makes it non-conditional to CONFIG_WORD_SIZE
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> arch/powerpc/include/asm/checksum.h | 9 ------
> ---
> arch/powerpc/lib/Makefile | 3 +--
> arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} | 0
> 3 files changed, 1 insertion(+), 11 deletions(-)
> rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c}
> (100%)
I wonder why it was 64-bit specific in the first place.
CCing Anton Blanchard.
-Scott
>
> diff --git a/arch/powerpc/include/asm/checksum.h
> b/arch/powerpc/include/asm/checksum.h
> index d2ca07b..afa6722 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -47,21 +47,12 @@ extern __wsum csum_partial_copy_generic(const void
> *src, void *dst,
> int len, __wsum sum,
> int *src_err, int *dst_err);
>
> -#ifdef __powerpc64__
> #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER
> extern __wsum csum_and_copy_from_user(const void __user *src, void *dst,
> int len, __wsum sum, int *err_ptr);
> #define HAVE_CSUM_COPY_USER
> extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
> int len, __wsum sum, int *err_ptr);
> -#else
> -/*
> - * the same as csum_partial, but copies from src to dst while it
> - * checksums.
> - */
> -#define csum_partial_copy_from_user(src, dst, len, sum, errp) \
> - csum_partial_copy_generic((__force const void *)(src), (dst),
> (len), (sum), (errp), NULL)
> -#endif
>
> #define csum_partial_copy_nocheck(src, dst, len, sum) \
> csum_partial_copy_generic((src), (dst), (len), (sum), NULL, NULL)
> diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
> index a47e142..e46b068 100644
> --- a/arch/powerpc/lib/Makefile
> +++ b/arch/powerpc/lib/Makefile
> @@ -22,8 +22,7 @@ obj64-$(CONFIG_SMP) += locks.o
> obj64-$(CONFIG_ALTIVEC) += vmx-helper.o
>
> ifeq ($(CONFIG_GENERIC_CSUM),)
> -obj-y += checksum_$(CONFIG_WORD_SIZE).o
> -obj-$(CONFIG_PPC64) += checksum_wrappers_64.o
> +obj-y += checksum_$(CONFIG_WORD_SIZE).o checksum_wrappers.o
> endif
>
> obj-$(CONFIG_PPC_EMULATE_SSTEP) += sstep.o ldstfp.o
> diff --git a/arch/powerpc/lib/checksum_wrappers_64.c
> b/arch/powerpc/lib/checksum_wrappers.c
> similarity index 100%
> rename from arch/powerpc/lib/checksum_wrappers_64.c
> rename to arch/powerpc/lib/checksum_wrappers.c
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
@ 2015-10-23 3:30 ` Scott Wood
2016-02-29 12:53 ` Christophe Leroy
0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23 3:30 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> r5 does contain the value to be updated, so lets use r5 all way long
> for that. It makes the code more readable.
>
> To avoid confusion, it is better to use adde instead of addc
>
> The first addition is useless. Its only purpose is to clear carry.
> As r4 is a signed int that is always positive, this can be done by
> using srawi instead of srwi
>
> Let's also remove the comment about bdnz having no overhead as it
> is not correct on all powerpc, at least on MPC8xx
>
> In the last part, in our situation, the remaining quantity of bytes
> to be proceeded is between 0 and 3. Therefore, we can base that part
> on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
> then proceding on comparisons and substractions.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
> 1 file changed, 17 insertions(+), 20 deletions(-)
Do you have benchmarks for these optimizations?
-Scott
>
> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
> index 3472372..9c12602 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -27,35 +27,32 @@
> * csum_partial(buff, len, sum)
> */
> _GLOBAL(csum_partial)
> - addic r0,r5,0
> subi r3,r3,4
> - srwi. r6,r4,2
> + srawi. r6,r4,2 /* Divide len by 4 and also clear carry */
> beq 3f /* if we're doing < 4 bytes */
> - andi. r5,r3,2 /* Align buffer to longword boundary */
> + andi. r0,r3,2 /* Align buffer to longword boundary */
> beq+ 1f
> - lhz r5,4(r3) /* do 2 bytes to get aligned */
> - addi r3,r3,2
> + lhz r0,4(r3) /* do 2 bytes to get aligned */
> subi r4,r4,2
> - addc r0,r0,r5
> + addi r3,r3,2
> srwi. r6,r4,2 /* # words to do */
> + adde r5,r5,r0
> beq 3f
> 1: mtctr r6
> -2: lwzu r5,4(r3) /* the bdnz has zero overhead, so it should */
> - adde r0,r0,r5 /* be unnecessary to unroll this loop */
> +2: lwzu r0,4(r3)
> + adde r5,r5,r0
> bdnz 2b
> - andi. r4,r4,3
> -3: cmpwi 0,r4,2
> - blt+ 4f
> - lhz r5,4(r3)
> +3: andi. r0,r4,2
> + beq+ 4f
> + lhz r0,4(r3)
> addi r3,r3,2
> - subi r4,r4,2
> - adde r0,r0,r5
> -4: cmpwi 0,r4,1
> - bne+ 5f
> - lbz r5,4(r3)
> - slwi r5,r5,8 /* Upper byte of word */
> - adde r0,r0,r5
> -5: addze r3,r0 /* add in final carry */
> + adde r5,r5,r0
> +4: andi. r0,r4,1
> + beq+ 5f
> + lbz r0,4(r3)
> + slwi r0,r0,8 /* Upper byte of word */
> + adde r5,r5,r0
> +5: addze r3,r5 /* add in final carry */
> blr
>
> /*
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
@ 2015-10-23 3:32 ` Scott Wood
2016-03-05 5:29 ` [9/9] " Scott Wood
1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2015-10-23 3:32 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> csum_partial is often called for small fixed length packets
> for which it is suboptimal to use the generic csum_partial()
> function.
>
> For instance, in my configuration, I got:
> * One place calling it with constant len 4
> * Seven places calling it with constant len 8
> * Three places calling it with constant len 14
> * One place calling it with constant len 20
> * One place calling it with constant len 24
> * One place calling it with constant len 32
>
> This patch renames csum_partial() to __csum_partial() and
> implements csum_partial() as a wrapper inline function which
> * uses csum_add() for small 16bits multiple constant length
> * uses ip_fast_csum() for other 32bits multiple constant
> * uses __csum_partial() in all other cases
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> arch/powerpc/include/asm/checksum.h | 80 ++++++++++++++++++++++++++--------
> ---
> arch/powerpc/lib/checksum_32.S | 4 +-
> arch/powerpc/lib/checksum_64.S | 4 +-
> arch/powerpc/lib/ppc_ksyms.c | 2 +-
> 4 files changed, 62 insertions(+), 28 deletions(-)
Benchmarks?
-Scott
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
@ 2015-10-23 3:33 ` Scott Wood
2016-02-29 7:26 ` Christophe Leroy
0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23 3:33 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> Simplify csum_add(a, b) in case a or b is constant 0
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> arch/powerpc/include/asm/checksum.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/checksum.h
> b/arch/powerpc/include/asm/checksum.h
> index 56deea8..f8a9704 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum
> addend)
> {
> #ifdef __powerpc64__
> u64 res = (__force u64)csum;
> +#endif
> + if (__builtin_constant_p(csum) && csum == 0)
> + return addend;
> + if (__builtin_constant_p(addend) && addend == 0)
> + return csum;
>
> +#ifdef __powerpc64__
> res += (__force u64)addend;
> return (__force __wsum)((u32)res + (res >> 32));
> #else
How often does this happen?
-Scott
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
2015-10-23 3:26 ` Scott Wood
@ 2015-10-28 11:11 ` Anton Blanchard
0 siblings, 0 replies; 22+ messages in thread
From: Anton Blanchard @ 2015-10-28 11:11 UTC (permalink / raw)
To: Scott Wood
Cc: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, linux-kernel, linuxppc-dev, netdev
Hi Scott,
> I wonder why it was 64-bit specific in the first place.
I think it was part of a series where I added my 64bit assembly checksum
routines, and I didn't step back and think that the wrapper code would
be useful on 32 bit.
Anton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9] powerpc: inline ip_fast_csum()
2015-09-23 5:43 ` Denis Kirjanov
@ 2016-02-29 7:25 ` Christophe Leroy
0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29 7:25 UTC (permalink / raw)
To: Denis Kirjanov
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
scottwood, linux-kernel, linuxppc-dev, netdev
Le 23/09/2015 07:43, Denis Kirjanov a écrit :
> On 9/22/15, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>> In several architectures, ip_fast_csum() is inlined
>> There are functions like ip_send_check() which do nothing
>> much more than calling ip_fast_csum().
>> Inlining ip_fast_csum() allows the compiler to optimise better
> Hi Christophe,
> I did try it and see no difference on ppc64. Did you test with socklib
> with modified loopback and if so do you have any numbers?
Hi Denis,
I put a mftbl at start and end of ip_send_check() and tested on a MPC885:
* Without ip_fast_csum() inlined, approxymatly 7 TB ticks are spent in
ip_send_check()
* With ip_fast_csum() inlined, approxymatly 5,4 TB ticks are spent in
ip_send_check()
So it is about 23% time reduction.
Christophe
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
2015-10-23 3:33 ` Scott Wood
@ 2016-02-29 7:26 ` Christophe Leroy
0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29 7:26 UTC (permalink / raw)
To: Scott Wood
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
Le 23/10/2015 05:33, Scott Wood a écrit :
> On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
>> Simplify csum_add(a, b) in case a or b is constant 0
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>> arch/powerpc/include/asm/checksum.h | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/checksum.h
>> b/arch/powerpc/include/asm/checksum.h
>> index 56deea8..f8a9704 100644
>> --- a/arch/powerpc/include/asm/checksum.h
>> +++ b/arch/powerpc/include/asm/checksum.h
>> @@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum
>> addend)
>> {
>> #ifdef __powerpc64__
>> u64 res = (__force u64)csum;
>> +#endif
>> + if (__builtin_constant_p(csum) && csum == 0)
>> + return addend;
>> + if (__builtin_constant_p(addend) && addend == 0)
>> + return csum;
>>
>> +#ifdef __powerpc64__
>> res += (__force u64)addend;
>> return (__force __wsum)((u32)res + (res >> 32));
>> #else
> How often does this happen?
>
>
In the following patch (9/9), csum_add() is used to implement
csum_partial() for small blocks.
In several places in the networking code, csum_partial() is called with
0 as initial sum.
Christophe
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
2015-10-23 3:30 ` Scott Wood
@ 2016-02-29 12:53 ` Christophe Leroy
0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29 12:53 UTC (permalink / raw)
To: Scott Wood
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
Le 23/10/2015 05:30, Scott Wood a écrit :
> On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
>> r5 does contain the value to be updated, so lets use r5 all way long
>> for that. It makes the code more readable.
>>
>> To avoid confusion, it is better to use adde instead of addc
>>
>> The first addition is useless. Its only purpose is to clear carry.
>> As r4 is a signed int that is always positive, this can be done by
>> using srawi instead of srwi
>>
>> Let's also remove the comment about bdnz having no overhead as it
>> is not correct on all powerpc, at least on MPC8xx
>>
>> In the last part, in our situation, the remaining quantity of bytes
>> to be proceeded is between 0 and 3. Therefore, we can base that part
>> on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
>> then proceding on comparisons and substractions.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>> arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
>> 1 file changed, 17 insertions(+), 20 deletions(-)
> Do you have benchmarks for these optimizations?
>
> -Scott
Using mftbl() to get timebase just before and after call to
csum_partial(), I get the following on an MPC885:
* 78 bytes packets: 9% faster (11,5 to 10,4 tb ticks)
* 328 bytes packets: 3% faster (47,9 to 46,5 tb ticks)
Christophe
>
>> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
>> index 3472372..9c12602 100644
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -27,35 +27,32 @@
>> * csum_partial(buff, len, sum)
>> */
>> _GLOBAL(csum_partial)
>> - addic r0,r5,0
>> subi r3,r3,4
>> - srwi. r6,r4,2
>> + srawi. r6,r4,2 /* Divide len by 4 and also clear carry */
>> beq 3f /* if we're doing < 4 bytes */
>> - andi. r5,r3,2 /* Align buffer to longword boundary */
>> + andi. r0,r3,2 /* Align buffer to longword boundary */
>> beq+ 1f
>> - lhz r5,4(r3) /* do 2 bytes to get aligned */
>> - addi r3,r3,2
>> + lhz r0,4(r3) /* do 2 bytes to get aligned */
>> subi r4,r4,2
>> - addc r0,r0,r5
>> + addi r3,r3,2
>> srwi. r6,r4,2 /* # words to do */
>> + adde r5,r5,r0
>> beq 3f
>> 1: mtctr r6
>> -2: lwzu r5,4(r3) /* the bdnz has zero overhead, so it should */
>> - adde r0,r0,r5 /* be unnecessary to unroll this loop */
>> +2: lwzu r0,4(r3)
>> + adde r5,r5,r0
>> bdnz 2b
>> - andi. r4,r4,3
>> -3: cmpwi 0,r4,2
>> - blt+ 4f
>> - lhz r5,4(r3)
>> +3: andi. r0,r4,2
>> + beq+ 4f
>> + lhz r0,4(r3)
>> addi r3,r3,2
>> - subi r4,r4,2
>> - adde r0,r0,r5
>> -4: cmpwi 0,r4,1
>> - bne+ 5f
>> - lbz r5,4(r3)
>> - slwi r5,r5,8 /* Upper byte of word */
>> - adde r0,r0,r5
>> -5: addze r3,r0 /* add in final carry */
>> + adde r5,r5,r0
>> +4: andi. r0,r4,1
>> + beq+ 5f
>> + lbz r0,4(r3)
>> + slwi r0,r0,8 /* Upper byte of word */
>> + adde r5,r5,r0
>> +5: addze r3,r5 /* add in final carry */
>> blr
>>
>> /*
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [4/9] powerpc: inline ip_fast_csum()
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
2015-09-23 5:43 ` Denis Kirjanov
@ 2016-03-05 3:50 ` Scott Wood
1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2016-03-05 3:50 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
scottwood, netdev, linuxppc-dev, linux-kernel
On Tue, Sep 22, 2015 at 04:34:25PM +0200, Christophe Leroy wrote:
> @@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
> #endif
> }
>
> +/*
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksum on 4 octet boundaries. ihl is the number
> + * of 32-bit words and is always >= 5.
> + */
> +static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int ihl)
> +{
> + u32 *ptr = (u32 *)iph + 1;
const?
> +#ifdef __powerpc64__
> + unsigned int i;
> + u64 s = *(__force u32 *)iph;
const?
Why __force?
> + s += (s >> 32);
> + return (__force __wsum)s;
> +
> +#else
> + __wsum sum, tmp;
> +
> + asm("mtctr %3;"
> + "addc %0,%4,%5;"
> + "1:lwzu %1, 4(%2);"
> + "adde %0,%0,%1;"
> + "bdnz 1b;"
> + "addze %0,%0;"
> + : "=r"(sum), "=r"(tmp), "+b"(ptr)
> + : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
> + : "ctr", "xer", "memory");
Space between " and (
Space after :
const in cast
I've fixed these up while applying.
-Scott
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [9/9] powerpc: optimise csum_partial() call when len is constant
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
2015-10-23 3:32 ` Scott Wood
@ 2016-03-05 5:29 ` Scott Wood
1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2016-03-05 5:29 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
scottwood, netdev, linuxppc-dev, linux-kernel
On Tue, Sep 22, 2015 at 04:34:36PM +0200, Christophe Leroy wrote:
> +/*
> + * computes the checksum of a memory block at buff, length len,
> + * and adds in "sum" (32-bit)
> + *
> + * returns a 32-bit number suitable for feeding into itself
> + * or csum_tcpudp_magic
> + *
> + * this function must be called with even lengths, except
> + * for the last fragment, which may be odd
> + *
> + * it's best to have buff aligned on a 32-bit boundary
> + */
> +__wsum __csum_partial(const void *buff, int len, __wsum sum);
> +
> +static inline __wsum csum_partial(const void *buff, int len, __wsum sum)
> +{
> + if (__builtin_constant_p(len) && len == 0)
> + return sum;
> +
> + if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
> + __wsum sum1;
> +
> + if (len == 2)
> + sum1 = (__force u32)*(u16 *)buff;
> + if (len >= 4)
> + sum1 = *(u32 *)buff;
> + if (len == 6)
> + sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 4));
> + if (len >= 8)
> + sum1 = csum_add(sum1, *(u32 *)(buff + 4));
> + if (len == 10)
> + sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 8));
> + if (len >= 12)
> + sum1 = csum_add(sum1, *(u32 *)(buff + 8));
> + if (len == 14)
> + sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 12));
> + if (len >= 16)
> + sum1 = csum_add(sum1, *(u32 *)(buff + 12));
> +
> + sum = csum_add(sum1, sum);
Why the final csum_add instead of s/sum1/sum/ and putting csum_add in the
"len == 2" and "len >= 4" cases?
The (__force u32) casts are unnecessary. Or rather, it should be
(__force __wsum) -- on all of them, not just the 16-bit ones.
The pointer casts should be const.
> + } else if (__builtin_constant_p(len) && (len & 3) == 0) {
> + sum = csum_add(ip_fast_csum_nofold(buff, len >> 2), sum);
It may not make a functional difference, but based on the csum_add()
argument names and other csum_add() usage, sum should come first
and the new content second.
-Scott
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-03-05 5:29 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
2015-10-23 3:26 ` Scott Wood
2015-10-28 11:11 ` Anton Blanchard
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
2015-09-23 5:43 ` Denis Kirjanov
2016-02-29 7:25 ` Christophe Leroy
2016-03-05 3:50 ` [4/9] " Scott Wood
2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
2015-10-23 3:30 ` Scott Wood
2016-02-29 12:53 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
2015-10-23 3:33 ` Scott Wood
2016-02-29 7:26 ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
2015-10-23 3:32 ` Scott Wood
2016-03-05 5:29 ` [9/9] " Scott Wood
2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).