* [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
@ 2018-05-22 6:57 Christophe Leroy
2018-05-23 18:34 ` Segher Boessenkool
0 siblings, 1 reply; 6+ messages in thread
From: Christophe Leroy @ 2018-05-22 6:57 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, segher
Cc: linux-kernel, linuxppc-dev, netdev
The generic csum_ipv6_magic() generates a pretty bad result
00000000 <csum_ipv6_magic>: (PPC32)
0: 81 23 00 00 lwz r9,0(r3)
4: 81 03 00 04 lwz r8,4(r3)
8: 7c e7 4a 14 add r7,r7,r9
c: 7d 29 38 10 subfc r9,r9,r7
10: 7d 4a 51 10 subfe r10,r10,r10
14: 7d 27 42 14 add r9,r7,r8
18: 7d 2a 48 50 subf r9,r10,r9
1c: 80 e3 00 08 lwz r7,8(r3)
20: 7d 08 48 10 subfc r8,r8,r9
24: 7d 4a 51 10 subfe r10,r10,r10
28: 7d 29 3a 14 add r9,r9,r7
2c: 81 03 00 0c lwz r8,12(r3)
30: 7d 2a 48 50 subf r9,r10,r9
34: 7c e7 48 10 subfc r7,r7,r9
38: 7d 4a 51 10 subfe r10,r10,r10
3c: 7d 29 42 14 add r9,r9,r8
40: 7d 2a 48 50 subf r9,r10,r9
44: 80 e4 00 00 lwz r7,0(r4)
48: 7d 08 48 10 subfc r8,r8,r9
4c: 7d 4a 51 10 subfe r10,r10,r10
50: 7d 29 3a 14 add r9,r9,r7
54: 7d 2a 48 50 subf r9,r10,r9
58: 81 04 00 04 lwz r8,4(r4)
5c: 7c e7 48 10 subfc r7,r7,r9
60: 7d 4a 51 10 subfe r10,r10,r10
64: 7d 29 42 14 add r9,r9,r8
68: 7d 2a 48 50 subf r9,r10,r9
6c: 80 e4 00 08 lwz r7,8(r4)
70: 7d 08 48 10 subfc r8,r8,r9
74: 7d 4a 51 10 subfe r10,r10,r10
78: 7d 29 3a 14 add r9,r9,r7
7c: 7d 2a 48 50 subf r9,r10,r9
80: 81 04 00 0c lwz r8,12(r4)
84: 7c e7 48 10 subfc r7,r7,r9
88: 7d 4a 51 10 subfe r10,r10,r10
8c: 7d 29 42 14 add r9,r9,r8
90: 7d 2a 48 50 subf r9,r10,r9
94: 7d 08 48 10 subfc r8,r8,r9
98: 7d 4a 51 10 subfe r10,r10,r10
9c: 7d 29 2a 14 add r9,r9,r5
a0: 7d 2a 48 50 subf r9,r10,r9
a4: 7c a5 48 10 subfc r5,r5,r9
a8: 7c 63 19 10 subfe r3,r3,r3
ac: 7d 29 32 14 add r9,r9,r6
b0: 7d 23 48 50 subf r9,r3,r9
b4: 7c c6 48 10 subfc r6,r6,r9
b8: 7c 63 19 10 subfe r3,r3,r3
bc: 7c 63 48 50 subf r3,r3,r9
c0: 54 6a 80 3e rotlwi r10,r3,16
c4: 7c 63 52 14 add r3,r3,r10
c8: 7c 63 18 f8 not r3,r3
cc: 54 63 84 3e rlwinm r3,r3,16,16,31
d0: 4e 80 00 20 blr
0000000000000000 <.csum_ipv6_magic>: (PPC64)
0: 81 23 00 00 lwz r9,0(r3)
4: 80 03 00 04 lwz r0,4(r3)
8: 81 63 00 08 lwz r11,8(r3)
c: 7c e7 4a 14 add r7,r7,r9
10: 7f 89 38 40 cmplw cr7,r9,r7
14: 7d 47 02 14 add r10,r7,r0
18: 7d 30 10 26 mfocrf r9,1
1c: 55 29 f7 fe rlwinm r9,r9,30,31,31
20: 7d 4a 4a 14 add r10,r10,r9
24: 7f 80 50 40 cmplw cr7,r0,r10
28: 7d 2a 5a 14 add r9,r10,r11
2c: 80 03 00 0c lwz r0,12(r3)
30: 81 44 00 00 lwz r10,0(r4)
34: 7d 10 10 26 mfocrf r8,1
38: 55 08 f7 fe rlwinm r8,r8,30,31,31
3c: 7d 29 42 14 add r9,r9,r8
40: 81 04 00 04 lwz r8,4(r4)
44: 7f 8b 48 40 cmplw cr7,r11,r9
48: 7d 29 02 14 add r9,r9,r0
4c: 7d 70 10 26 mfocrf r11,1
50: 55 6b f7 fe rlwinm r11,r11,30,31,31
54: 7d 29 5a 14 add r9,r9,r11
58: 7f 80 48 40 cmplw cr7,r0,r9
5c: 7d 29 52 14 add r9,r9,r10
60: 7c 10 10 26 mfocrf r0,1
64: 54 00 f7 fe rlwinm r0,r0,30,31,31
68: 7d 69 02 14 add r11,r9,r0
6c: 7f 8a 58 40 cmplw cr7,r10,r11
70: 7c 0b 42 14 add r0,r11,r8
74: 81 44 00 08 lwz r10,8(r4)
78: 7c f0 10 26 mfocrf r7,1
7c: 54 e7 f7 fe rlwinm r7,r7,30,31,31
80: 7c 00 3a 14 add r0,r0,r7
84: 7f 88 00 40 cmplw cr7,r8,r0
88: 7d 20 52 14 add r9,r0,r10
8c: 80 04 00 0c lwz r0,12(r4)
90: 7d 70 10 26 mfocrf r11,1
94: 55 6b f7 fe rlwinm r11,r11,30,31,31
98: 7d 29 5a 14 add r9,r9,r11
9c: 7f 8a 48 40 cmplw cr7,r10,r9
a0: 7d 29 02 14 add r9,r9,r0
a4: 7d 70 10 26 mfocrf r11,1
a8: 55 6b f7 fe rlwinm r11,r11,30,31,31
ac: 7d 29 5a 14 add r9,r9,r11
b0: 7f 80 48 40 cmplw cr7,r0,r9
b4: 7d 29 2a 14 add r9,r9,r5
b8: 7c 10 10 26 mfocrf r0,1
bc: 54 00 f7 fe rlwinm r0,r0,30,31,31
c0: 7d 29 02 14 add r9,r9,r0
c4: 7f 85 48 40 cmplw cr7,r5,r9
c8: 7c 09 32 14 add r0,r9,r6
cc: 7d 50 10 26 mfocrf r10,1
d0: 55 4a f7 fe rlwinm r10,r10,30,31,31
d4: 7c 00 52 14 add r0,r0,r10
d8: 7f 80 30 40 cmplw cr7,r0,r6
dc: 7d 30 10 26 mfocrf r9,1
e0: 55 29 ef fe rlwinm r9,r9,29,31,31
e4: 7c 09 02 14 add r0,r9,r0
e8: 54 03 80 3e rotlwi r3,r0,16
ec: 7c 03 02 14 add r0,r3,r0
f0: 7c 03 00 f8 not r3,r0
f4: 78 63 84 22 rldicl r3,r3,48,48
f8: 4e 80 00 20 blr
This patch implements it in assembly for both PPC32 and PPC64
Link: https://github.com/linuxppc/linux/issues/9
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v3: Add support for PPC64 (please review, especially whether instructions order in optimal)
v2: Fix number of args in final addze
arch/powerpc/include/asm/checksum.h | 6 ++++++
arch/powerpc/lib/checksum_32.S | 33 +++++++++++++++++++++++++++++++++
arch/powerpc/lib/checksum_64.S | 28 ++++++++++++++++++++++++++++
3 files changed, 67 insertions(+)
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index 54065caa40b3..a78a57e5058d 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -13,6 +13,7 @@
#include <asm-generic/checksum.h>
#else
#include <linux/bitops.h>
+#include <linux/in6.h>
/*
* Computes the checksum of a memory block at src, length len,
* and adds in "sum" (32-bit), while copying the block to dst.
@@ -211,6 +212,11 @@ static inline __sum16 ip_compute_csum(const void *buff, int len)
return csum_fold(csum_partial(buff, len, 0));
}
+#define _HAVE_ARCH_IPV6_CSUM
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto, __wsum sum);
+
#endif
#endif /* __KERNEL__ */
#endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 9a671c774b22..9167ab088f04 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -293,3 +293,36 @@ dst_error:
EX_TABLE(51b, dst_error);
EXPORT_SYMBOL(csum_partial_copy_generic)
+
+/*
+ * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ * const struct in6_addr *daddr,
+ * __u32 len, __u8 proto, __wsum sum)
+ */
+
+_GLOBAL(csum_ipv6_magic)
+ lwz r8, 0(r3)
+ lwz r9, 4(r3)
+ lwz r10, 8(r3)
+ lwz r11, 12(r3)
+ addc r0, r5, r6
+ adde r0, r0, r7
+ adde r0, r0, r8
+ adde r0, r0, r9
+ adde r0, r0, r10
+ adde r0, r0, r11
+ lwz r8, 0(r4)
+ lwz r9, 4(r4)
+ lwz r10, 8(r4)
+ lwz r11, 12(r4)
+ adde r0, r0, r8
+ adde r0, r0, r9
+ adde r0, r0, r10
+ adde r0, r0, r11
+ addze r0, r0
+ rotlwi r3, r0, 16
+ add r3, r0, r3
+ not r3, r3
+ rlwinm r3, r3, 16, 16, 31
+ blr
+EXPORT_SYMBOL(csum_ipv6_magic)
diff --git a/arch/powerpc/lib/checksum_64.S b/arch/powerpc/lib/checksum_64.S
index d7f1a966136e..66900baf5600 100644
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -429,3 +429,31 @@ dstnr; stb r6,0(r4)
stw r6,0(r8)
blr
EXPORT_SYMBOL(csum_partial_copy_generic)
+
+/*
+ * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ * const struct in6_addr *daddr,
+ * __u32 len, __u8 proto, __wsum sum)
+ */
+
+_GLOBAL(csum_ipv6_magic)
+ ld r8, 0(r3)
+ ld r9, 8(r3)
+ add r5, r5, r6
+ addc r0, r8, r9
+ ld r10, 0(r4)
+ ld r11, 8(r4)
+ adde r0, r0, r10
+ add r5, r5, r7
+ adde r0, r0, r11
+ adde r0, r0, r5
+ addze r0, r0
+ rotldi r3 ,r0, 32 /* fold two 32 bit halves together */
+ add r3, r0, r3
+ srdi r0, r3, 32
+ rotlwi r3, r0, 16 /* fold two 16 bit halves together */
+ add r3, r0, r3
+ not r3, r3
+ rlwinm r3, r3, 16, 16, 31
+ blr
+EXPORT_SYMBOL(csum_ipv6_magic)
--
2.13.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
2018-05-22 6:57 [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly Christophe Leroy
@ 2018-05-23 18:34 ` Segher Boessenkool
2018-05-24 6:20 ` Christophe LEROY
0 siblings, 1 reply; 6+ messages in thread
From: Segher Boessenkool @ 2018-05-23 18:34 UTC (permalink / raw)
To: Christophe Leroy
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> The generic csum_ipv6_magic() generates a pretty bad result
<snip>
Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -293,3 +293,36 @@ dst_error:
> EX_TABLE(51b, dst_error);
>
> EXPORT_SYMBOL(csum_partial_copy_generic)
> +
> +/*
> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> + * const struct in6_addr *daddr,
> + * __u32 len, __u8 proto, __wsum sum)
> + */
> +
> +_GLOBAL(csum_ipv6_magic)
> + lwz r8, 0(r3)
> + lwz r9, 4(r3)
> + lwz r10, 8(r3)
> + lwz r11, 12(r3)
> + addc r0, r5, r6
> + adde r0, r0, r7
> + adde r0, r0, r8
> + adde r0, r0, r9
> + adde r0, r0, r10
> + adde r0, r0, r11
> + lwz r8, 0(r4)
> + lwz r9, 4(r4)
> + lwz r10, 8(r4)
> + lwz r11, 12(r4)
> + adde r0, r0, r8
> + adde r0, r0, r9
> + adde r0, r0, r10
> + adde r0, r0, r11
> + addze r0, r0
> + rotlwi r3, r0, 16
> + add r3, r0, r3
> + not r3, r3
> + rlwinm r3, r3, 16, 16, 31
> + blr
> +EXPORT_SYMBOL(csum_ipv6_magic)
Clustering the loads and carry insns together is pretty much the worst you
can do on most 32-bit CPUs.
Segher
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
2018-05-23 18:34 ` Segher Boessenkool
@ 2018-05-24 6:20 ` Christophe LEROY
2018-05-24 10:18 ` Christophe Leroy
2018-05-24 19:42 ` Segher Boessenkool
0 siblings, 2 replies; 6+ messages in thread
From: Christophe LEROY @ 2018-05-24 6:20 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
>> The generic csum_ipv6_magic() generates a pretty bad result
>
> <snip>
>
> Please try with a more recent compiler, what you used is pretty ancient.
> It's not like recent compilers do great on this either, but it's not
> *that* bad anymore ;-)
>
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -293,3 +293,36 @@ dst_error:
>> EX_TABLE(51b, dst_error);
>>
>> EXPORT_SYMBOL(csum_partial_copy_generic)
>> +
>> +/*
>> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
>> + * const struct in6_addr *daddr,
>> + * __u32 len, __u8 proto, __wsum sum)
>> + */
>> +
>> +_GLOBAL(csum_ipv6_magic)
>> + lwz r8, 0(r3)
>> + lwz r9, 4(r3)
>> + lwz r10, 8(r3)
>> + lwz r11, 12(r3)
>> + addc r0, r5, r6
>> + adde r0, r0, r7
>> + adde r0, r0, r8
>> + adde r0, r0, r9
>> + adde r0, r0, r10
>> + adde r0, r0, r11
>> + lwz r8, 0(r4)
>> + lwz r9, 4(r4)
>> + lwz r10, 8(r4)
>> + lwz r11, 12(r4)
>> + adde r0, r0, r8
>> + adde r0, r0, r9
>> + adde r0, r0, r10
>> + adde r0, r0, r11
>> + addze r0, r0
>> + rotlwi r3, r0, 16
>> + add r3, r0, r3
>> + not r3, r3
>> + rlwinm r3, r3, 16, 16, 31
>> + blr
>> +EXPORT_SYMBOL(csum_ipv6_magic)
>
> Clustering the loads and carry insns together is pretty much the worst you
> can do on most 32-bit CPUs.
Oh, really ? __csum_partial is written that way too.
Right, now I tried interleaving the lwz and adde. I get no improvment at
all on a 885, but I get a 15% improvment on a 8321.
Christophe
>
>
> Segher
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
2018-05-24 6:20 ` Christophe LEROY
@ 2018-05-24 10:18 ` Christophe Leroy
2018-05-24 19:55 ` Segher Boessenkool
2018-05-24 19:42 ` Segher Boessenkool
1 sibling, 1 reply; 6+ messages in thread
From: Christophe Leroy @ 2018-05-24 10:18 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linux-kernel, Paul Mackerras, netdev, linuxppc-dev
On 05/24/2018 06:20 AM, Christophe LEROY wrote:
>
>
> Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
>> On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
>>> The generic csum_ipv6_magic() generates a pretty bad result
>>
>> <snip>
>>
>> Please try with a more recent compiler, what you used is pretty ancient.
>> It's not like recent compilers do great on this either, but it's not
>> *that* bad anymore ;-)
Here is what I get with GCC 8.1
It doesn't look much better, does it ?
net/ipv6/ip6_checksum.o: file format elf32-powerpc
Disassembly of section .text:
00000000 <csum_ipv6_magic>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 80 04 00 00 lwz r0,0(r4)
8: 81 64 00 04 lwz r11,4(r4)
c: 81 04 00 08 lwz r8,8(r4)
10: 93 e1 00 0c stw r31,12(r1)
14: 81 43 00 00 lwz r10,0(r3)
18: 83 e3 00 04 lwz r31,4(r3)
1c: 81 23 00 08 lwz r9,8(r3)
20: 81 83 00 0c lwz r12,12(r3)
24: 7c ea 3a 14 add r7,r10,r7
28: 7d 4a 38 10 subfc r10,r10,r7
2c: 7c ff 3a 14 add r7,r31,r7
30: 81 44 00 0c lwz r10,12(r4)
34: 7c 63 19 10 subfe r3,r3,r3
38: 7c 63 38 50 subf r3,r3,r7
3c: 7f ff 18 10 subfc r31,r31,r3
40: 7c e9 1a 14 add r7,r9,r3
44: 83 e1 00 0c lwz r31,12(r1)
48: 7c 63 19 10 subfe r3,r3,r3
4c: 38 21 00 10 addi r1,r1,16
50: 7c 63 38 50 subf r3,r3,r7
54: 7d 29 18 10 subfc r9,r9,r3
58: 7d 2c 1a 14 add r9,r12,r3
5c: 7c 63 19 10 subfe r3,r3,r3
60: 7c 63 48 50 subf r3,r3,r9
64: 7d 8c 18 10 subfc r12,r12,r3
68: 7d 20 1a 14 add r9,r0,r3
6c: 7c 63 19 10 subfe r3,r3,r3
70: 7c 63 48 50 subf r3,r3,r9
74: 7c 00 18 10 subfc r0,r0,r3
78: 7d 2b 1a 14 add r9,r11,r3
7c: 7c 63 19 10 subfe r3,r3,r3
80: 7c 63 48 50 subf r3,r3,r9
84: 7d 6b 18 10 subfc r11,r11,r3
88: 7d 28 1a 14 add r9,r8,r3
8c: 7c 63 19 10 subfe r3,r3,r3
90: 7c 63 48 50 subf r3,r3,r9
94: 7d 08 18 10 subfc r8,r8,r3
98: 7d 2a 1a 14 add r9,r10,r3
9c: 7c 63 19 10 subfe r3,r3,r3
a0: 7c 63 48 50 subf r3,r3,r9
a4: 7d 4a 18 10 subfc r10,r10,r3
a8: 7d 23 2a 14 add r9,r3,r5
ac: 7c 63 19 10 subfe r3,r3,r3
b0: 7c 63 48 50 subf r3,r3,r9
b4: 7c a5 18 10 subfc r5,r5,r3
b8: 7c 63 32 14 add r3,r3,r6
bc: 7d 29 49 10 subfe r9,r9,r9
c0: 7d 29 18 50 subf r9,r9,r3
c4: 7c c6 48 10 subfc r6,r6,r9
c8: 7c 63 19 10 subfe r3,r3,r3
cc: 7c 63 48 50 subf r3,r3,r9
d0: 54 69 80 3e rotlwi r9,r3,16
d4: 7c 63 4a 14 add r3,r3,r9
d8: 7c 63 18 f8 not r3,r3
dc: 54 63 84 3e rlwinm r3,r3,16,16,31
e0: 4e 80 00 20 blr
net/ipv6/ip6_checksum.o: file format elf64-powerpc
Disassembly of section .text:
0000000000000000 <.csum_ipv6_magic>:
0: fb e1 ff f8 std r31,-8(r1)
4: 81 43 00 00 lwz r10,0(r3)
8: 81 83 00 04 lwz r12,4(r3)
c: 81 23 00 08 lwz r9,8(r3)
10: 80 03 00 0c lwz r0,12(r3)
14: 7c e7 52 14 add r7,r7,r10
18: 80 64 00 08 lwz r3,8(r4)
1c: 81 04 00 00 lwz r8,0(r4)
20: 78 ff 00 20 clrldi r31,r7,32
24: 7c ec 3a 14 add r7,r12,r7
28: 81 64 00 04 lwz r11,4(r4)
2c: 7f ea f8 50 subf r31,r10,r31
30: 81 44 00 0c lwz r10,12(r4)
34: 7b ff 0f e0 rldicl r31,r31,1,63
38: 7c ff 3a 14 add r7,r31,r7
3c: eb e1 ff f8 ld r31,-8(r1)
40: 78 e4 00 20 clrldi r4,r7,32
44: 7c e9 3a 14 add r7,r9,r7
48: 7d 8c 20 50 subf r12,r12,r4
4c: 79 8c 0f e0 rldicl r12,r12,1,63
50: 7d 8c 3a 14 add r12,r12,r7
54: 79 87 00 20 clrldi r7,r12,32
58: 7d 80 62 14 add r12,r0,r12
5c: 7d 29 38 50 subf r9,r9,r7
60: 79 29 0f e0 rldicl r9,r9,1,63
64: 7d 29 62 14 add r9,r9,r12
68: 79 27 00 20 clrldi r7,r9,32
6c: 7d 28 4a 14 add r9,r8,r9
70: 7c 00 38 50 subf r0,r0,r7
74: 78 00 0f e0 rldicl r0,r0,1,63
78: 7c 00 4a 14 add r0,r0,r9
7c: 78 09 00 20 clrldi r9,r0,32
80: 7c 0b 02 14 add r0,r11,r0
84: 7d 08 48 50 subf r8,r8,r9
88: 79 08 0f e0 rldicl r8,r8,1,63
8c: 7d 08 02 14 add r8,r8,r0
90: 79 09 00 20 clrldi r9,r8,32
94: 7d 03 42 14 add r8,r3,r8
98: 7d 2b 48 50 subf r9,r11,r9
9c: 79 29 0f e0 rldicl r9,r9,1,63
a0: 7d 29 42 14 add r9,r9,r8
a4: 79 28 00 20 clrldi r8,r9,32
a8: 7d 2a 4a 14 add r9,r10,r9
ac: 7d 03 40 50 subf r8,r3,r8
b0: 79 08 0f e0 rldicl r8,r8,1,63
b4: 7d 08 4a 14 add r8,r8,r9
b8: 79 09 00 20 clrldi r9,r8,32
bc: 7d 08 2a 14 add r8,r8,r5
c0: 7d 2a 48 50 subf r9,r10,r9
c4: 79 29 0f e0 rldicl r9,r9,1,63
c8: 7d 29 42 14 add r9,r9,r8
cc: 79 2a 00 20 clrldi r10,r9,32
d0: 7d 29 32 14 add r9,r9,r6
d4: 7c a5 50 50 subf r5,r5,r10
d8: 78 a5 0f e0 rldicl r5,r5,1,63
dc: 7d 25 4a 14 add r9,r5,r9
e0: 79 2a 00 20 clrldi r10,r9,32
e4: 7c c6 50 50 subf r6,r6,r10
e8: 78 c6 0f e0 rldicl r6,r6,1,63
ec: 7c c6 4a 14 add r6,r6,r9
f0: 54 c3 80 3e rotlwi r3,r6,16
f4: 7c c6 1a 14 add r6,r6,r3
f8: 7c c3 30 f8 not r3,r6
fc: 78 63 84 22 rldicl r3,r3,48,48
100: 4e 80 00 20 blr
Christophe
>>
>>> --- a/arch/powerpc/lib/checksum_32.S
>>> +++ b/arch/powerpc/lib/checksum_32.S
>>> @@ -293,3 +293,36 @@ dst_error:
>>> EX_TABLE(51b, dst_error);
>>> EXPORT_SYMBOL(csum_partial_copy_generic)
>>> +
>>> +/*
>>> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
>>> + * const struct in6_addr *daddr,
>>> + * __u32 len, __u8 proto, __wsum sum)
>>> + */
>>> +
>>> +_GLOBAL(csum_ipv6_magic)
>>> + lwz r8, 0(r3)
>>> + lwz r9, 4(r3)
>>> + lwz r10, 8(r3)
>>> + lwz r11, 12(r3)
>>> + addc r0, r5, r6
>>> + adde r0, r0, r7
>>> + adde r0, r0, r8
>>> + adde r0, r0, r9
>>> + adde r0, r0, r10
>>> + adde r0, r0, r11
>>> + lwz r8, 0(r4)
>>> + lwz r9, 4(r4)
>>> + lwz r10, 8(r4)
>>> + lwz r11, 12(r4)
>>> + adde r0, r0, r8
>>> + adde r0, r0, r9
>>> + adde r0, r0, r10
>>> + adde r0, r0, r11
>>> + addze r0, r0
>>> + rotlwi r3, r0, 16
>>> + add r3, r0, r3
>>> + not r3, r3
>>> + rlwinm r3, r3, 16, 16, 31
>>> + blr
>>> +EXPORT_SYMBOL(csum_ipv6_magic)
>>
>> Clustering the loads and carry insns together is pretty much the worst
>> you
>> can do on most 32-bit CPUs.
>
> Oh, really ? __csum_partial is written that way too.
>
> Right, now I tried interleaving the lwz and adde. I get no improvment at
> all on a 885, but I get a 15% improvment on a 8321.
>
> Christophe
>
>>
>>
>> Segher
>>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
2018-05-24 6:20 ` Christophe LEROY
2018-05-24 10:18 ` Christophe Leroy
@ 2018-05-24 19:42 ` Segher Boessenkool
1 sibling, 0 replies; 6+ messages in thread
From: Segher Boessenkool @ 2018-05-24 19:42 UTC (permalink / raw)
To: Christophe LEROY
Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
linux-kernel, linuxppc-dev, netdev
On Thu, May 24, 2018 at 08:20:16AM +0200, Christophe LEROY wrote:
> Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>+_GLOBAL(csum_ipv6_magic)
> >>+ lwz r8, 0(r3)
> >>+ lwz r9, 4(r3)
> >>+ lwz r10, 8(r3)
> >>+ lwz r11, 12(r3)
> >>+ addc r0, r5, r6
> >>+ adde r0, r0, r7
> >>+ adde r0, r0, r8
> >>+ adde r0, r0, r9
> >>+ adde r0, r0, r10
> >>+ adde r0, r0, r11
> >>+ lwz r8, 0(r4)
> >>+ lwz r9, 4(r4)
> >>+ lwz r10, 8(r4)
> >>+ lwz r11, 12(r4)
> >>+ adde r0, r0, r8
> >>+ adde r0, r0, r9
> >>+ adde r0, r0, r10
> >>+ adde r0, r0, r11
> >>+ addze r0, r0
> >>+ rotlwi r3, r0, 16
> >>+ add r3, r0, r3
> >>+ not r3, r3
> >>+ rlwinm r3, r3, 16, 16, 31
> >>+ blr
> >>+EXPORT_SYMBOL(csum_ipv6_magic)
> >
> >Clustering the loads and carry insns together is pretty much the worst you
> >can do on most 32-bit CPUs.
>
> Oh, really ? __csum_partial is written that way too.
I thought I told you about this before? Maybe not.
> Right, now I tried interleaving the lwz and adde. I get no improvment at
> all on a 885, but I get a 15% improvment on a 8321.
It won't likely help on single-issue cores (like the one 885 has), yes.
Segher
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly
2018-05-24 10:18 ` Christophe Leroy
@ 2018-05-24 19:55 ` Segher Boessenkool
0 siblings, 0 replies; 6+ messages in thread
From: Segher Boessenkool @ 2018-05-24 19:55 UTC (permalink / raw)
To: Christophe Leroy; +Cc: linux-kernel, Paul Mackerras, netdev, linuxppc-dev
On Thu, May 24, 2018 at 10:18:44AM +0000, Christophe Leroy wrote:
> On 05/24/2018 06:20 AM, Christophe LEROY wrote:
> >Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >>On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>>The generic csum_ipv6_magic() generates a pretty bad result
> >>
> >><snip>
> >>
> >>Please try with a more recent compiler, what you used is pretty ancient.
> >>It's not like recent compilers do great on this either, but it's not
> >>*that* bad anymore ;-)
>
> Here is what I get with GCC 8.1
> It doesn't look much better, does it ?
There are no more mfocrf, which is a big speedup. Other than that it is
pretty lousy still, I totally agree. This improvement happened quite a
while ago, it's fixed in GCC 6.
Segher
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-05-24 19:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-22 6:57 [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly Christophe Leroy
2018-05-23 18:34 ` Segher Boessenkool
2018-05-24 6:20 ` Christophe LEROY
2018-05-24 10:18 ` Christophe Leroy
2018-05-24 19:55 ` Segher Boessenkool
2018-05-24 19:42 ` Segher Boessenkool
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).