* [PATCH v4] arm64: lib: accelerate do_csum
@ 2019-11-06 2:20 Shaokun Zhang
2020-01-08 17:20 ` Will Deacon
0 siblings, 1 reply; 4+ messages in thread
From: Shaokun Zhang @ 2019-11-06 2:20 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Ard Biesheuvel, Robin Murphy, Shaokun Zhang, Lingyan Huang,
Catalin Marinas, Will Deacon
From: Lingyan Huang <huanglingyan2@huawei.com>
Function do_csum() in lib/checksum.c is used to compute checksum,
which is turned out to be slowly and costs a lot of resources.
Let's accelerate the checksum computation for arm64.
While we test its performance on Huawei Kunpeng 920 SoC, as follow:
1cycle general(ns) csum_128(ns) csum_64(ns)
64B: 160 80 50
256B: 120 70 60
1023B: 350 140 150
1024B: 350 130 140
1500B: 470 170 180
2048B: 630 210 240
4095B: 1220 390 430
4096B: 1230 390 430
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Originally-from: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
Hi,
Apologies that we post this version so later, because we want to
optimise it better, Lingyan tested it performance which is attached
in commit log. Both(128 and 64) are much better than the initial
code.
ChangeLog:
based on Robin's code and change strides from 64 to 128.
arch/arm64/include/asm/checksum.h | 3 ++
arch/arm64/lib/Makefile | 2 +-
arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/lib/csum.c
diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h
index d064a50deb5f..8d2a7de39744 100644
--- a/arch/arm64/include/asm/checksum.h
+++ b/arch/arm64/include/asm/checksum.h
@@ -35,6 +35,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
}
#define ip_fast_csum ip_fast_csum
+extern unsigned int do_csum(const unsigned char *buff, int len);
+#define do_csum do_csum
+
#include <asm-generic/checksum.h>
#endif /* __ASM_CHECKSUM_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index c21b936dc01d..8a0644a831eb 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o \
copy_to_user.o copy_in_user.o copy_page.o \
clear_page.o memchr.o memcpy.o memmove.o memset.o \
memcmp.o strcmp.o strncmp.o strlen.o strnlen.o \
- strchr.o strrchr.o tishift.o
+ strchr.o strrchr.o tishift.o csum.o
ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
diff --git a/arch/arm64/lib/csum.c b/arch/arm64/lib/csum.c
new file mode 100644
index 000000000000..20170d8dcbc4
--- /dev/null
+++ b/arch/arm64/lib/csum.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright (C) 2019 Arm Ltd.
+
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kernel.h>
+
+#include <net/checksum.h>
+
+
+/* handle overflow */
+static __uint128_t accumulate128(__uint128_t sum, __uint128_t data)
+{
+ sum += (sum >> 64) | (sum << 64);
+ data += (data >> 64) | (data << 64);
+ return (sum + data) >> 64;
+}
+
+unsigned int do_csum(const unsigned char *buff, int len)
+{
+ unsigned int offset, shift, sum, count;
+ __uint128_t data, *ptr;
+ __uint128_t sum128 = 0;
+ u64 sum64 = 0;
+
+ offset = (unsigned long)buff & 0xf;
+ /*
+ * This is to all intents and purposes safe, since rounding down cannot
+ * result in a different page or cache line being accessed, and @buff
+ * should absolutely not be pointing to anything read-sensitive. We do,
+ * however, have to be careful not to piss off KASAN, which means using
+ * unchecked reads to accommodate the head and tail, for which we'll
+ * compensate with an explicit check up-front.
+ */
+ kasan_check_read(buff, len);
+ ptr = (__uint128_t *)(buff - offset);
+ shift = offset * 8;
+
+ /*
+ * Head: zero out any excess leading bytes. Shifting back by the same
+ * amount should be at least as fast as any other way of handling the
+ * odd/even alignment, and means we can ignore it until the very end.
+ */
+ data = READ_ONCE_NOCHECK(*ptr++);
+#ifdef __LITTLE_ENDIAN
+ data = (data >> shift) << shift;
+#else
+ data = (data << shift) >> shift;
+#endif
+ count = 16 - offset;
+
+ /* Body: straightforward aligned loads from here on... */
+
+ while (len > count) {
+ sum128 = accumulate128(sum128, data);
+ data = READ_ONCE_NOCHECK(*ptr++);
+ count += 16;
+ }
+ /*
+ * Tail: zero any over-read bytes similarly to the head, again
+ * preserving odd/even alignment.
+ */
+ shift = (count - len) * 8;
+#ifdef __LITTLE_ENDIAN
+ data = (data << shift) >> shift;
+#else
+ data = (data >> shift) << shift;
+#endif
+ sum128 = accumulate128(sum128, data);
+
+ /* Finally, folding */
+ sum128 += (sum128 >> 64) | (sum128 << 64);
+ sum64 = (sum128 >> 64);
+ sum64 += (sum64 >> 32) | (sum64 << 32);
+ sum = (sum64 >> 32);
+ sum += (sum >> 16) | (sum << 16);
+ if (offset & 1)
+ return (u16)swab32(sum);
+
+ return sum >> 16;
+}
--
2.7.4
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4] arm64: lib: accelerate do_csum
2019-11-06 2:20 [PATCH v4] arm64: lib: accelerate do_csum Shaokun Zhang
@ 2020-01-08 17:20 ` Will Deacon
2020-01-11 8:09 ` Shaokun Zhang
0 siblings, 1 reply; 4+ messages in thread
From: Will Deacon @ 2020-01-08 17:20 UTC (permalink / raw)
To: Shaokun Zhang, robin.murphy
Cc: Lingyan Huang, Ard Biesheuvel, linux-arm-kernel, Catalin Marinas
On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
> From: Lingyan Huang <huanglingyan2@huawei.com>
>
> Function do_csum() in lib/checksum.c is used to compute checksum,
> which is turned out to be slowly and costs a lot of resources.
> Let's accelerate the checksum computation for arm64.
>
> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
> 1cycle general(ns) csum_128(ns) csum_64(ns)
> 64B: 160 80 50
> 256B: 120 70 60
> 1023B: 350 140 150
> 1024B: 350 130 140
> 1500B: 470 170 180
> 2048B: 630 210 240
> 4095B: 1220 390 430
> 4096B: 1230 390 430
>
> Cc: Will Deacon <will@kernel.org>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Originally-from: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
> Hi,
> Apologies that we post this version so later, because we want to
> optimise it better, Lingyan tested it performance which is attached
> in commit log. Both(128 and 64) are much better than the initial
> code.
> ChangeLog:
> based on Robin's code and change strides from 64 to 128.
>
> arch/arm64/include/asm/checksum.h | 3 ++
> arch/arm64/lib/Makefile | 2 +-
> arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++
> 3 files changed, 85 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/lib/csum.c
Robin -- any chance you could look at this please? If it's based on your
code then hopefully it's straightforward to review ;)
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] arm64: lib: accelerate do_csum
2020-01-08 17:20 ` Will Deacon
@ 2020-01-11 8:09 ` Shaokun Zhang
2020-01-14 12:18 ` Robin Murphy
0 siblings, 1 reply; 4+ messages in thread
From: Shaokun Zhang @ 2020-01-11 8:09 UTC (permalink / raw)
To: Will Deacon, robin.murphy
Cc: Lingyan Huang, Ard Biesheuvel, zhaoyuke, linux-arm-kernel,
Catalin Marinas
+Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged
the performance issue.
Hi Will,
Thanks for reactivate this thread.
Robin, any comments are welcome and hopefully it can be merged in mainline.
Thanks,
Shaokun
On 2020/1/9 1:20, Will Deacon wrote:
> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
>> From: Lingyan Huang <huanglingyan2@huawei.com>
>>
>> Function do_csum() in lib/checksum.c is used to compute checksum,
>> which is turned out to be slowly and costs a lot of resources.
>> Let's accelerate the checksum computation for arm64.
>>
>> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>> 1cycle general(ns) csum_128(ns) csum_64(ns)
>> 64B: 160 80 50
>> 256B: 120 70 60
>> 1023B: 350 140 150
>> 1024B: 350 130 140
>> 1500B: 470 170 180
>> 2048B: 630 210 240
>> 4095B: 1220 390 430
>> 4096B: 1230 390 430
>>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> Originally-from: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>> Hi,
>> Apologies that we post this version so later, because we want to
>> optimise it better, Lingyan tested it performance which is attached
>> in commit log. Both(128 and 64) are much better than the initial
>> code.
>> ChangeLog:
>> based on Robin's code and change strides from 64 to 128.
>>
>> arch/arm64/include/asm/checksum.h | 3 ++
>> arch/arm64/lib/Makefile | 2 +-
>> arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 85 insertions(+), 1 deletion(-)
>> create mode 100644 arch/arm64/lib/csum.c
>
> Robin -- any chance you could look at this please? If it's based on your
> code then hopefully it's straightforward to review ;)
>
> Will
>
> .
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] arm64: lib: accelerate do_csum
2020-01-11 8:09 ` Shaokun Zhang
@ 2020-01-14 12:18 ` Robin Murphy
0 siblings, 0 replies; 4+ messages in thread
From: Robin Murphy @ 2020-01-14 12:18 UTC (permalink / raw)
To: Shaokun Zhang, Will Deacon
Cc: Lingyan Huang, Ard Biesheuvel, zhaoyuke, linux-arm-kernel,
Catalin Marinas
On 2020-01-11 8:09 am, Shaokun Zhang wrote:
> +Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged
> the performance issue.
>
> Hi Will,
>
> Thanks for reactivate this thread.
> Robin, any comments are welcome and hopefully it can be merged in mainline.
OK, I had a play with this yesterday, and somewhat surprisingly even
with a recent GCC it results in utterly dreadful code. I would always
have expected the head/tail alignment in __uint128_t arithmetic to be
ugly, and it certainly is, but even the "*ptr++" load in the main loop
comes out as this delightful nugget:
e3c: f8410502 ldr x2, [x8], #16
e40: f85f8103 ldur x3, [x8, #-8]
(Clang does at least manage to emit a post-indexed LDP there, but the
rest remains pretty awful)
Overall it ends up noticeably slower than even the generic code for
small buffers. I rigged up a crude userspace test to run the numbers
below - data is average call time in nanoseconds; "new" is the routine
from this patch, "new2/3/4" are are loop-tuning variations of what I
came up with when I then went back to my WIP branch and finished off my
original idea. Once I've confirmed I got big-endian right I'll send out
another patch :)
Robin.
GCC 9.2.0:
---------
Cortex-A53
size generic new new2 new3 new4
3: 20 35 22 22 24
8: 34 35 22 22 24
15: 36 35 29 23 25
48: 69 45 38 38 39
64: 80 50 49 44 44
256: 217 117 99 110 92
4096: 2908 1310 1146 1269 983
1048576: 860430 461694 461694 493173 451201
Cortex-A72
size generic new new2 new3 new4
3: 8 21 10 9 10
8: 20 21 10 9 10
15: 16 21 12 11 11
48: 29 29 18 19 20
64: 35 30 24 21 23
256: 125 66 48 46 46
4096: 1720 778 532 573 450
1048576: 472187 272819 188874 220354 167888
Clang 9.0.1:
-----------
Cortex-A53
size generic new new2 new3 new4
3: 21 29 21 21 21
8: 33 29 21 21 21
15: 35 28 24 23 23
48: 73 39 36 37 38
64: 85 44 46 42 44
256: 220 110 107 107 89
4096: 2949 1310 1187 1310 942
1048576: 849937 451201 472187 482680 451201
Cortex-A72
size generic new new2 new3 new4
3: 8 16 10 10 10
8: 23 16 10 10 10
15: 16 16 12 12 12
48: 27 21 18 20 20
64: 31 24 24 22 23
256: 125 53 48 63 46
4096: 1720 655 573 860 532
1048576: 472187 230847 209861 272819 188874
>
> Thanks,
> Shaokun
>
> On 2020/1/9 1:20, Will Deacon wrote:
>> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
>>> From: Lingyan Huang <huanglingyan2@huawei.com>
>>>
>>> Function do_csum() in lib/checksum.c is used to compute checksum,
>>> which is turned out to be slowly and costs a lot of resources.
>>> Let's accelerate the checksum computation for arm64.
>>>
>>> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>>> 1cycle general(ns) csum_128(ns) csum_64(ns)
>>> 64B: 160 80 50
>>> 256B: 120 70 60
>>> 1023B: 350 140 150
>>> 1024B: 350 130 140
>>> 1500B: 470 170 180
>>> 2048B: 630 210 240
>>> 4095B: 1220 390 430
>>> 4096B: 1230 390 430
>>>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> Originally-from: Robin Murphy <robin.murphy@arm.com>
>>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
>>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>>> ---
>>> Hi,
>>> Apologies that we post this version so later, because we want to
>>> optimise it better, Lingyan tested it performance which is attached
>>> in commit log. Both(128 and 64) are much better than the initial
>>> code.
>>> ChangeLog:
>>> based on Robin's code and change strides from 64 to 128.
>>>
>>> arch/arm64/include/asm/checksum.h | 3 ++
>>> arch/arm64/lib/Makefile | 2 +-
>>> arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++
>>> 3 files changed, 85 insertions(+), 1 deletion(-)
>>> create mode 100644 arch/arm64/lib/csum.c
>>
>> Robin -- any chance you could look at this please? If it's based on your
>> code then hopefully it's straightforward to review ;)
>>
>> Will
>>
>> .
>>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-01-14 12:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 2:20 [PATCH v4] arm64: lib: accelerate do_csum Shaokun Zhang
2020-01-08 17:20 ` Will Deacon
2020-01-11 8:09 ` Shaokun Zhang
2020-01-14 12:18 ` Robin Murphy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).