From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.5 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37D9BC43387 for ; Thu, 10 Jan 2019 08:03:37 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 06B77206B7 for ; Thu, 10 Jan 2019 08:03:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="EZxo64o/" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 06B77206B7 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:To:Subject:Reply-To:Content-ID:Content-Description :Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=NPppZDyTrZkK9G/6xecmOlza9ryCyIYC1dOdHHcwgIM=; b=EZxo64o/C7Bjs1 e/wJ3i3yXLRtAu8/ix31l5lgiQKQjcmnyciGgGi9Az/BvHRS2ZnDnArYlC8SbGSz8zfe3kSOUaONk 30lZ3jQ6beJK+IQsjeSl7Wo1qaynvqfx1HIxz78sy4RPhzFUQJVLcydQVJiT9YMx8UiHRmL/mnD4K JS3d00ou0VZmaCBY9OxpvjkPaYzkRb4sgguVnqe6AF/78aA9SXdKHuoSYwc93d0zbwGSD0YzYJO9L gQQdDopPsJjX8zue6cjApsHVFAoAyDnpf5tm3HIyJEyIZILqNr5Iy6U45HcFtu2TeuS/lUf4HT533 d3XKvJFKS6JZkAtrRF6w==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1ghVJP-00078W-CS; Thu, 10 Jan 2019 08:03:35 +0000 Received: from szxga07-in.huawei.com ([45.249.212.35] helo=huawei.com) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1ghVJJ-00077w-6Z for linux-arm-kernel@lists.infradead.org; Thu, 10 Jan 2019 08:03:33 +0000 Received: from DGGEMS404-HUB.china.huawei.com (unknown [172.30.72.60]) by Forcepoint Email with ESMTP id 887FE16BA33A461C0ACB; Thu, 10 Jan 2019 16:03:20 +0800 (CST) Received: from [127.0.0.1] (10.40.74.132) by DGGEMS404-HUB.china.huawei.com (10.3.19.204) with Microsoft SMTP Server id 14.3.408.0; Thu, 10 Jan 2019 16:03:15 +0800 Subject: Re: [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction To: Dave Martin References: <1546739729-17234-1-git-send-email-huanglingyan2@huawei.com> <20190109142513.GA3554@e103592.cambridge.arm.com> From: "huanglingyan (A)" Message-ID: <19a3acc0-87d7-5ca2-46a5-c3324797491d@huawei.com> Date: Thu, 10 Jan 2019 16:03:59 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 In-Reply-To: <20190109142513.GA3554@e103592.cambridge.arm.com> X-Originating-IP: [10.40.74.132] X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20190110_000330_745036_82FAFFE8 X-CRM114-Status: GOOD ( 33.10 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Catalin Marinas , Will Deacon , linux-arm-kernel@lists.infradead.org, Ard Biesheuvel Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2019/1/9 22:58, Dave Martin wrote: > On Sun, Jan 06, 2019 at 09:55:29AM +0800, Lingyan Huang wrote: >> Function do_csum() in lib/checksum.c is used to compute checksum, >> which is turned out to be slowly and costs a lot of resources. >> Let's use neon instructions to accelerate the checksum computation >> for arm64. >> >> ------ >> V2 ==> V3: >> only modify the arm64 codes instead of modifying headers >> under asm-generic and code in lib/checksum.c. >> ------ >> ------ >> V1 ==> V2: >> Change NEON assembly code to NEON intrinsic code which is built >> on top of arm_neon.h to avoid dropping into assembly. >> ------ >> >> Here is the comparison results of function ip_compute_csum() between >> general do_csum() and neon instruction do_csum(). The test platform >> is HUAWEI 1620 server with TAISHAN cores. >> >> len(1000cycle) general(ns) do_csum_neon(ns) >> 64B: 58060 59460 >> 128B: 82930 83930 >> 256B: 132480 73570 >> 512B: 230100 86230 >> 1024B: 426600 98200 > For testing purposes, you would need to cover all possible aligments of > buff and all values of (len % 16), as well as testing on big- and > little-endian. > > Otherwise, a lot of code for handling edge cases won't be tested. > > > Also, it would be interesting to know which sizes are actually most > common at runtime, to avoid falling into the trap of optimising for > rare cases at the expense of the more common cases. > > You could try adding some instrumentation to collect some statistics on > this. Different alignements of buff and different lengths should be showing here. The test-report will be more detailed as you said. The main scene of do_csum is the Internet package checksum as I know. The package length is usually 64 Byte - MTU(1500 Byte default). My test platform is little-endian based. I will try my best to find a big-endian test platform. > >> Cc: Catalin Marinas >> Cc: Will Deacon >> Cc: Ard Biesheuvel >> Signed-off-by: Lingyan Huang >> --- >> arch/arm64/include/asm/checksum.h | 5 ++ >> arch/arm64/lib/Makefile | 8 +-- >> arch/arm64/lib/checksum.c | 26 ++++++++ >> arch/arm64/lib/do_csum.c | 136 ++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 171 insertions(+), 4 deletions(-) >> create mode 100644 arch/arm64/lib/checksum.c >> create mode 100644 arch/arm64/lib/do_csum.c >> >> diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h >> index 0b6f5a7..7acd713 100644 >> --- a/arch/arm64/include/asm/checksum.h >> +++ b/arch/arm64/include/asm/checksum.h >> @@ -26,6 +26,10 @@ static inline __sum16 csum_fold(__wsum csum) >> } >> #define csum_fold csum_fold >> >> +#define do_csum do_csum >> +unsigned int do_csum(const unsigned char *buff, int len); >> +extern unsigned int do_csum_arm(const unsigned char *buff, int len); >> + >> static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) >> { >> __uint128_t tmp; >> @@ -46,6 +50,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) >> } >> #define ip_fast_csum ip_fast_csum >> >> + >> #include >> >> #endif /* __ASM_CHECKSUM_H */ >> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile >> index 5540a16..c0b5b8c 100644 >> --- a/arch/arm64/lib/Makefile >> +++ b/arch/arm64/lib/Makefile >> @@ -3,12 +3,12 @@ lib-y := clear_user.o delay.o copy_from_user.o \ >> copy_to_user.o copy_in_user.o copy_page.o \ >> clear_page.o memchr.o memcpy.o memmove.o memset.o \ >> memcmp.o strcmp.o strncmp.o strlen.o strnlen.o \ >> - strchr.o strrchr.o tishift.o >> + strchr.o strrchr.o tishift.o checksum.o >> >> ifeq ($(CONFIG_KERNEL_MODE_NEON), y) >> -obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o >> -CFLAGS_REMOVE_xor-neon.o += -mgeneral-regs-only >> -CFLAGS_xor-neon.o += -ffreestanding > Did you mean to delete these lines? It's my mistake. >> +obj-y += do_csum.o >> +CFLAGS_REMOVE_do_csum.o += -mgeneral-regs-only >> +CFLAGS_do_csum.o += -ffreestanding >> endif >> >> # Tell the compiler to treat all general purpose registers (with the >> diff --git a/arch/arm64/lib/checksum.c b/arch/arm64/lib/checksum.c >> new file mode 100644 >> index 0000000..15a31bb >> --- /dev/null >> +++ b/arch/arm64/lib/checksum.c >> @@ -0,0 +1,26 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * >> + * Authors: Lingyan Huang >> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved. >> + * >> + * Generic C or neon implementation of do_csum operations. >> + * Choose faster neon instructions when NEON is supported. >> + * >> + */ >> + >> +#include >> +#include >> +#include >> + >> +#define CSUM_NEON_THRESHOLD 128 >> + >> +unsigned int do_csum(const unsigned char *buff, int len) >> +{ >> +#ifdef CONFIG_KERNEL_MODE_NEON >> + if (len >= CSUM_NEON_THRESHOLD) >> + return do_csum_arm(buff, len); >> +#endif /* CONFIG_KERNEL_MODE_NEON */ >> +#undef do_csum >> + return 0; >> +} >> diff --git a/arch/arm64/lib/do_csum.c b/arch/arm64/lib/do_csum.c >> new file mode 100644 >> index 0000000..893583f >> --- /dev/null >> +++ b/arch/arm64/lib/do_csum.c >> @@ -0,0 +1,136 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Authors: Lingyan Huang >> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved. >> + * >> + * Generic C or neon implementation of do_csum operations. >> + * Choose faster neon instructions when NEON is supported. >> + * >> + */ >> + >> +#include >> +#include > Unless you call may_use_simd() (see below) I can't see why you need to > include . Yeah, may_use_simd() should be called. >> +#include >> +#include >> +#include >> + >> +#define CSUM_NEON_THRESHOLD 128 >> +#ifdef CONFIG_KERNEL_MODE_NEON >> +static inline u32 from64to16(u64 x) >> +{ >> + /* add up 32-bit and 32-bit for 32+c bit */ >> + x = (x & 0xffffffff) + (x >> 32); >> + /* add up carry.. */ >> + x = (x & 0xffffffff) + (x >> 32); >> + /* add up 16-bit and 16-bit for 16+c bit */ >> + x = ((u32)x & 0xffff) + ((u32)x >> 16); >> + /* add up carry.. */ >> + x = ((u32)x & 0xffff) + ((u32)x >> 16); >> + return x; >> +} >> + >> +unsigned int do_csum_neon(const unsigned char *buff, int len) >> +{ >> + unsigned int odd, count; >> + uint64_t result = 0; >> + unsigned int count64; >> + uint32x4_t vzero = (uint32x4_t){0, 0, 0, 0}; >> + >> + register uint32x4_t v0, v1, v2, v3; > Is "register" needed here? Is there any impact on performance? > > Usually it's best to leave register allocation decisions up to the > compiler. OK > >> + >> + if (unlikely(len <= 0)) >> + return result; >> + >> + odd = 1 & (unsigned long)buff; >> + if (unlikely(odd)) { >> + result = *buff; > The generic code has a shift here for the little-endian case. Why don't > we need that here? I write this code in reference of arch x86 which not distinguish endianess. Maybe little-endian should be different from big-endian. > >> + len--; >> + buff++; >> + } >> + >> + count = len >> 1; >> + if (count) { >> + if (2 & (unsigned long)buff) { >> + result += *(unsigned short *)buff; >> + count--; >> + len -= 2; >> + buff += 2; >> + } >> + count >>= 1; /* nr of 32-bit words.. */ >> + if (count) { >> + if (4 & (unsigned long)buff) { >> + result += *(unsigned int *)buff; >> + count--; >> + len -= 4; >> + buff += 4; >> + } >> + count >>= 1; /* nr of 64-bit words.. */ >> + >> + v0 = vzero; >> + v1 = vzero; >> + v2 = vzero; >> + v3 = vzero; >> + >> + count64 = count >> 3; /* compute 64 Byte circle */ >> + while (count64) { >> + v0 = vpadalq_u16(v0, >> + vld1q_u16((uint16_t *)buff + 0)); > Can this loop iterate more than 65536 times? If it can, it looks like > we can overflow. > > (I think the initial value of len would have to be > 0x400000 in order > for this to happen.) Yes, you are right. I didn't consider the situation of such long length. I wonder if there are any application scenarios that have such long length. Besides, arch x86 do_csum() in file csum-partial_64.c is not considered of that situation either. >> + v1 = vpadalq_u16(v1, >> + vld1q_u16((uint16_t *)buff + 8)); >> + v2 = vpadalq_u16(v2, >> + vld1q_u16((uint16_t *)buff + 16)); >> + v3 = vpadalq_u16(v3, >> + vld1q_u16((uint16_t *)buff + 24)); >> + buff += 64; >> + count64--; >> + } >> + v0 = vaddq_u32(v0, v1); >> + v2 = vaddq_u32(v2, v3); >> + v0 = vaddq_u32(v0, v2); > Can't we defer the folding down until later? We could just accumulate > the next 16 bytes' result into v0, and do the folding all in one go > later on. Yes, we can. Since 16 bytes' computation doesn't need v1-v3. Folding down now or later seems to have no diffenence. > >> + >> + count %= 8; >> + while (count >= 2) { /* compute 16 byte circle */ >> + v0 = vpadalq_u16(v0, >> + vld1q_u16((uint16_t *)buff + 0)); >> + buff += 16; >> + count -= 2; >> + } >> + >> + result += vgetq_lane_u32(v0, 0); >> + result += vgetq_lane_u32(v0, 1); >> + result += vgetq_lane_u32(v0, 2); >> + result += vgetq_lane_u32(v0, 3); >> + if (count & 1) { >> + result += *(unsigned long long *)buff; >> + buff += 8; >> + } >> + if (len & 4) { >> + result += *(unsigned int *)buff; >> + buff += 4; >> + } >> + } >> + if (len & 2) { >> + result += *(unsigned short *)buff; >> + buff += 2; >> + } >> + } >> + if (len & 1) >> + result += *buff; > What about the little-endian case? > >> + result = from64to16(result); >> + if (odd) >> + result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); >> + return result; >> +} >> +#endif >> + >> + >> +unsigned int do_csum_arm(const unsigned char *buff, int len) >> +{ >> + unsigned int res; >> + >> + kernel_neon_begin(); >> + res = do_csum_neon(buff, len); > If len can be large, you should split into smaller blocks, with > kernel_neon_end()/_begin() between to provide a change to preempt if len > is large. (Splitting into smaller blocks may also help avoid overflow.) > > Ard may have may able to suggest how often this should be done. Good solutions for large length. > Also, there is no guarantee that you can use NEON in softirq context, > because the kernel may already have been using NEON when the softirq > fired. If this code may get called from softirq context, then you would > need something along the lines of > > if (may_use_simd()) { > kernel_neon_begin(); > /* NEON accelerated code */ > kernel_neon_end(); > } else { > /* Fallback C code */ > } > > [...] Yes, you are right. > > Cheers > ---Dave > > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel