linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] arm64: lib: accelerate do_csum
@ 2019-11-06  2:20 Shaokun Zhang
  2020-01-08 17:20 ` Will Deacon
  0 siblings, 1 reply; 4+ messages in thread
From: Shaokun Zhang @ 2019-11-06  2:20 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Ard Biesheuvel, Robin Murphy, Shaokun Zhang, Lingyan Huang,
	Catalin Marinas, Will Deacon

From: Lingyan Huang <huanglingyan2@huawei.com>

Function do_csum() in lib/checksum.c is used to compute checksum,
which is turned out to be slowly and costs a lot of resources.
Let's accelerate the checksum computation for arm64.

While we test its performance on Huawei Kunpeng 920 SoC, as follow:
 1cycle  general(ns)  csum_128(ns) csum_64(ns)
  64B:        160            80             50
 256B:        120            70             60
1023B:        350           140            150
1024B:        350           130            140
1500B:        470           170            180
2048B:        630           210            240
4095B:       1220           390            430
4096B:       1230           390            430

Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Originally-from: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
Hi,                                                                
Apologies that we post this version so later, because we want to   
optimise it better, Lingyan tested it performance which is attached
in commit log. Both(128 and 64) are much better than the initial   
code.                                                              
ChangeLog:
    based on Robin's code and change strides from 64 to 128.

 arch/arm64/include/asm/checksum.h |  3 ++
 arch/arm64/lib/Makefile           |  2 +-
 arch/arm64/lib/csum.c             | 81 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/lib/csum.c

diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h
index d064a50deb5f..8d2a7de39744 100644
--- a/arch/arm64/include/asm/checksum.h
+++ b/arch/arm64/include/asm/checksum.h
@@ -35,6 +35,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 }
 #define ip_fast_csum ip_fast_csum
 
+extern unsigned int do_csum(const unsigned char *buff, int len);
+#define do_csum do_csum
+
 #include <asm-generic/checksum.h>
 
 #endif	/* __ASM_CHECKSUM_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index c21b936dc01d..8a0644a831eb 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,7 +3,7 @@ lib-y		:= clear_user.o delay.o copy_from_user.o		\
 		   copy_to_user.o copy_in_user.o copy_page.o		\
 		   clear_page.o memchr.o memcpy.o memmove.o memset.o	\
 		   memcmp.o strcmp.o strncmp.o strlen.o strnlen.o	\
-		   strchr.o strrchr.o tishift.o
+		   strchr.o strrchr.o tishift.o csum.o
 
 ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
 obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
diff --git a/arch/arm64/lib/csum.c b/arch/arm64/lib/csum.c
new file mode 100644
index 000000000000..20170d8dcbc4
--- /dev/null
+++ b/arch/arm64/lib/csum.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright (C) 2019 Arm Ltd.
+
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kernel.h>
+
+#include <net/checksum.h>
+
+
+/* handle overflow */
+static __uint128_t accumulate128(__uint128_t sum, __uint128_t data)
+{
+	sum  += (sum  >> 64) | (sum  << 64);
+	data += (data >> 64) | (data << 64);
+	return (sum + data) >> 64;
+}
+
+unsigned int do_csum(const unsigned char *buff, int len)
+{
+	unsigned int offset, shift, sum, count;
+	__uint128_t data, *ptr;
+	__uint128_t sum128 = 0;
+	u64 sum64 = 0;
+
+	offset = (unsigned long)buff & 0xf;
+	/*
+	 * This is to all intents and purposes safe, since rounding down cannot
+	 * result in a different page or cache line being accessed, and @buff
+	 * should absolutely not be pointing to anything read-sensitive. We do,
+	 * however, have to be careful not to piss off KASAN, which means using
+	 * unchecked reads to accommodate the head and tail, for which we'll
+	 * compensate with an explicit check up-front.
+	 */
+	kasan_check_read(buff, len);
+	ptr = (__uint128_t *)(buff - offset);
+	shift = offset * 8;
+
+	/*
+	 * Head: zero out any excess leading bytes. Shifting back by the same
+	 * amount should be at least as fast as any other way of handling the
+	 * odd/even alignment, and means we can ignore it until the very end.
+	 */
+	data = READ_ONCE_NOCHECK(*ptr++);
+#ifdef __LITTLE_ENDIAN
+	data = (data >> shift) << shift;
+#else
+	data = (data << shift) >> shift;
+#endif
+	count = 16 - offset;
+
+	/* Body: straightforward aligned loads from here on... */
+
+	while (len > count) {
+		sum128 = accumulate128(sum128, data);
+		data = READ_ONCE_NOCHECK(*ptr++);
+		count += 16;
+	}
+	/*
+	 * Tail: zero any over-read bytes similarly to the head, again
+	 * preserving odd/even alignment.
+	 */
+	shift = (count - len) * 8;
+#ifdef __LITTLE_ENDIAN
+	data = (data << shift) >> shift;
+#else
+	data = (data >> shift) << shift;
+#endif
+	sum128 = accumulate128(sum128, data);
+
+	/* Finally, folding */
+	sum128 += (sum128 >> 64) | (sum128 << 64);
+	sum64 = (sum128 >> 64);
+	sum64 += (sum64 >> 32) | (sum64 << 32);
+	sum = (sum64 >> 32);
+	sum += (sum >> 16) | (sum << 16);
+	if (offset & 1)
+		return (u16)swab32(sum);
+
+	return sum >> 16;
+}
-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] arm64: lib: accelerate do_csum
  2019-11-06  2:20 [PATCH v4] arm64: lib: accelerate do_csum Shaokun Zhang
@ 2020-01-08 17:20 ` Will Deacon
  2020-01-11  8:09   ` Shaokun Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Will Deacon @ 2020-01-08 17:20 UTC (permalink / raw)
  To: Shaokun Zhang, robin.murphy
  Cc: Lingyan Huang, Ard Biesheuvel, linux-arm-kernel, Catalin Marinas

On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
> From: Lingyan Huang <huanglingyan2@huawei.com>
> 
> Function do_csum() in lib/checksum.c is used to compute checksum,
> which is turned out to be slowly and costs a lot of resources.
> Let's accelerate the checksum computation for arm64.
> 
> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>  1cycle  general(ns)  csum_128(ns) csum_64(ns)
>   64B:        160            80             50
>  256B:        120            70             60
> 1023B:        350           140            150
> 1024B:        350           130            140
> 1500B:        470           170            180
> 2048B:        630           210            240
> 4095B:       1220           390            430
> 4096B:       1230           390            430
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Originally-from: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
> Hi,                                                                
> Apologies that we post this version so later, because we want to   
> optimise it better, Lingyan tested it performance which is attached
> in commit log. Both(128 and 64) are much better than the initial   
> code.                                                              
> ChangeLog:
>     based on Robin's code and change strides from 64 to 128.
> 
>  arch/arm64/include/asm/checksum.h |  3 ++
>  arch/arm64/lib/Makefile           |  2 +-
>  arch/arm64/lib/csum.c             | 81 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 85 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/lib/csum.c

Robin -- any chance you could look at this please? If it's based on your
code then hopefully it's straightforward to review ;)

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] arm64: lib: accelerate do_csum
  2020-01-08 17:20 ` Will Deacon
@ 2020-01-11  8:09   ` Shaokun Zhang
  2020-01-14 12:18     ` Robin Murphy
  0 siblings, 1 reply; 4+ messages in thread
From: Shaokun Zhang @ 2020-01-11  8:09 UTC (permalink / raw)
  To: Will Deacon, robin.murphy
  Cc: Lingyan Huang, Ard Biesheuvel, zhaoyuke, linux-arm-kernel,
	Catalin Marinas

+Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged
the performance issue.

Hi Will,

Thanks for reactivate this thread.
Robin, any comments are welcome and hopefully it can be merged in mainline.

Thanks,
Shaokun

On 2020/1/9 1:20, Will Deacon wrote:
> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
>> From: Lingyan Huang <huanglingyan2@huawei.com>
>>
>> Function do_csum() in lib/checksum.c is used to compute checksum,
>> which is turned out to be slowly and costs a lot of resources.
>> Let's accelerate the checksum computation for arm64.
>>
>> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>>  1cycle  general(ns)  csum_128(ns) csum_64(ns)
>>   64B:        160            80             50
>>  256B:        120            70             60
>> 1023B:        350           140            150
>> 1024B:        350           130            140
>> 1500B:        470           170            180
>> 2048B:        630           210            240
>> 4095B:       1220           390            430
>> 4096B:       1230           390            430
>>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> Originally-from: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>> Hi,                                                                
>> Apologies that we post this version so later, because we want to   
>> optimise it better, Lingyan tested it performance which is attached
>> in commit log. Both(128 and 64) are much better than the initial   
>> code.                                                              
>> ChangeLog:
>>     based on Robin's code and change strides from 64 to 128.
>>
>>  arch/arm64/include/asm/checksum.h |  3 ++
>>  arch/arm64/lib/Makefile           |  2 +-
>>  arch/arm64/lib/csum.c             | 81 +++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 85 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm64/lib/csum.c
> 
> Robin -- any chance you could look at this please? If it's based on your
> code then hopefully it's straightforward to review ;)
> 
> Will
> 
> .
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] arm64: lib: accelerate do_csum
  2020-01-11  8:09   ` Shaokun Zhang
@ 2020-01-14 12:18     ` Robin Murphy
  0 siblings, 0 replies; 4+ messages in thread
From: Robin Murphy @ 2020-01-14 12:18 UTC (permalink / raw)
  To: Shaokun Zhang, Will Deacon
  Cc: Lingyan Huang, Ard Biesheuvel, zhaoyuke, linux-arm-kernel,
	Catalin Marinas

On 2020-01-11 8:09 am, Shaokun Zhang wrote:
> +Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged
> the performance issue.
> 
> Hi Will,
> 
> Thanks for reactivate this thread.
> Robin, any comments are welcome and hopefully it can be merged in mainline.

OK, I had a play with this yesterday, and somewhat surprisingly even 
with a recent GCC it results in utterly dreadful code. I would always 
have expected the head/tail alignment in __uint128_t arithmetic to be 
ugly, and it certainly is, but even the "*ptr++" load in the main loop 
comes out as this delightful nugget:

      e3c:       f8410502        ldr     x2, [x8], #16
      e40:       f85f8103        ldur    x3, [x8, #-8]

(Clang does at least manage to emit a post-indexed LDP there, but the 
rest remains pretty awful)

Overall it ends up noticeably slower than even the generic code for 
small buffers. I rigged up a crude userspace test to run the numbers 
below - data is average call time in nanoseconds; "new" is the routine 
from this patch, "new2/3/4" are are loop-tuning variations of what I 
came up with when I then went back to my WIP branch and finished off my 
original idea. Once I've confirmed I got big-endian right I'll send out 
another patch :)

Robin.


GCC 9.2.0:
---------
Cortex-A53
size            generic new     new2    new3    new4
        3:       20      35      22      22      24
        8:       34      35      22      22      24
       15:       36      35      29      23      25
       48:       69      45      38      38      39
       64:       80      50      49      44      44
      256:       217     117     99      110     92
     4096:       2908    1310    1146    1269    983
  1048576:       860430  461694  461694  493173  451201
Cortex-A72
size            generic new     new2    new3    new4
        3:       8       21      10      9       10
        8:       20      21      10      9       10
       15:       16      21      12      11      11
       48:       29      29      18      19      20
       64:       35      30      24      21      23
      256:       125     66      48      46      46
     4096:       1720    778     532     573     450
  1048576:       472187  272819  188874  220354  167888

Clang 9.0.1:
-----------
Cortex-A53
size            generic new     new2    new3    new4
        3:       21      29      21      21      21
        8:       33      29      21      21      21
       15:       35      28      24      23      23
       48:       73      39      36      37      38
       64:       85      44      46      42      44
      256:       220     110     107     107     89
     4096:       2949    1310    1187    1310    942
  1048576:       849937  451201  472187  482680  451201
Cortex-A72
size            generic new     new2    new3    new4
        3:       8       16      10      10      10
        8:       23      16      10      10      10
       15:       16      16      12      12      12
       48:       27      21      18      20      20
       64:       31      24      24      22      23
      256:       125     53      48      63      46
     4096:       1720    655     573     860     532
  1048576:       472187  230847  209861  272819  188874

> 
> Thanks,
> Shaokun
> 
> On 2020/1/9 1:20, Will Deacon wrote:
>> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote:
>>> From: Lingyan Huang <huanglingyan2@huawei.com>
>>>
>>> Function do_csum() in lib/checksum.c is used to compute checksum,
>>> which is turned out to be slowly and costs a lot of resources.
>>> Let's accelerate the checksum computation for arm64.
>>>
>>> While we test its performance on Huawei Kunpeng 920 SoC, as follow:
>>>   1cycle  general(ns)  csum_128(ns) csum_64(ns)
>>>    64B:        160            80             50
>>>   256B:        120            70             60
>>> 1023B:        350           140            150
>>> 1024B:        350           130            140
>>> 1500B:        470           170            180
>>> 2048B:        630           210            240
>>> 4095B:       1220           390            430
>>> 4096B:       1230           390            430
>>>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> Originally-from: Robin Murphy <robin.murphy@arm.com>
>>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
>>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>>> ---
>>> Hi,
>>> Apologies that we post this version so later, because we want to
>>> optimise it better, Lingyan tested it performance which is attached
>>> in commit log. Both(128 and 64) are much better than the initial
>>> code.
>>> ChangeLog:
>>>      based on Robin's code and change strides from 64 to 128.
>>>
>>>   arch/arm64/include/asm/checksum.h |  3 ++
>>>   arch/arm64/lib/Makefile           |  2 +-
>>>   arch/arm64/lib/csum.c             | 81 +++++++++++++++++++++++++++++++++++++++
>>>   3 files changed, 85 insertions(+), 1 deletion(-)
>>>   create mode 100644 arch/arm64/lib/csum.c
>>
>> Robin -- any chance you could look at this please? If it's based on your
>> code then hopefully it's straightforward to review ;)
>>
>> Will
>>
>> .
>>
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-01-14 12:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06  2:20 [PATCH v4] arm64: lib: accelerate do_csum Shaokun Zhang
2020-01-08 17:20 ` Will Deacon
2020-01-11  8:09   ` Shaokun Zhang
2020-01-14 12:18     ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).