From: "huanglingyan (A)" <huanglingyan2@huawei.com>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Zhangshaokun <zhangshaokun@hisilicon.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction
Date: Mon, 18 Feb 2019 16:49:53 +0800 [thread overview]
Message-ID: <5531d4f2-c4cd-822e-3c0f-b3cba6dc8e91@huawei.com> (raw)
In-Reply-To: <f4b7b467-410c-9454-535e-f8a413297fa1@huawei.com>
On 2019/2/14 17:57, huanglingyan (A) wrote:
> On 2019/2/14 1:55, Ard Biesheuvel wrote:
>> (+ Ilias)
>>
>> On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>> On 2019/2/12 15:07, Ard Biesheuvel wrote:
>>>>> On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>>> On 2019/1/18 19:14, Ard Biesheuvel wrote:
>>>>>>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>>>>> On 2019/1/17 0:46, Will Deacon wrote:
>>>>>>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>>>>>>>>>> On 2019/1/8 21:54, Will Deacon wrote:
>>>>>>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>>>>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>>>>>>>>> Please change this into
>>>>>>>>>>>>
>>>>>>>>>>>> if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>>>>>>>>> len >= CSUM_NEON_THRESHOLD &&
>>>>>>>>>>>> may_use_simd()) {
>>>>>>>>>>>> kernel_neon_begin();
>>>>>>>>>>>> res = do_csum_neon(buff, len);
>>>>>>>>>>>> kernel_neon_end();
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> and drop the intermediate do_csum_arm()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> + return do_csum_arm(buff, len);
>>>>>>>>>>>> +#endif /* CONFIG_KERNEL_MODE_NEON */
>>>>>>>>>>>>
>>>>>>>>>>>> No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> +#undef do_csum
>>>>>>>>>>>>
>>>>>>>>>>>> Can we drop this?
>>>>>>>>>>>>
>>>>>>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>>>>>>>>>> when reservering/restoring
>>>>>>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>>>>>>>>>> is Only used when
>>>>>>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>>>>>>>>>> checksum.c will be used in
>>>>>>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>>>>>>>>>> to have the oppotunity to
>>>>>>>>>>>> utilize the general codes.
>>>>>>>>>>> I don't think that's how it works :/
>>>>>>>>>>>
>>>>>>>>>>> Before we get deeper into the implementation, please could you justify the
>>>>>>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>>>>>>>>>> usually offloaded to the NIC?
>>>>>>>>>>>
>>>>>>>>>>> Will
>>>>>>>>>>>
>>>>>>>>>>> .
>>>>>>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
>>>>>>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>>>>>>>>>> machine and then receive.
>>>>>>>>> In the marketing blurb, that card boasts:
>>>>>>>>>
>>>>>>>>> `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>>>>>>>>>
>>>>>>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
>>>>>>>>> this given that it /really/ shouldn't be used on arm64 machines that care
>>>>>>>>> about network performance.
>>>>>>>>>
>>>>>>>>> Will
>>>>>>>>>
>>>>>>>>> .
>>>>>>>> Yeah, you are right. Checksum is usually done in network card which is told by
>>>>>>>> someone familiar with NIC. However, it may be used in testing scenaries and
>>>>>>>> some primary network cards. I think it's no harm to optimize this code while
>>>>>>>> other ARCHs have their own optimized versions.
>>>>>>> I disagree. If this code path is never exercised, we should not
>>>>>>> include it. We can revisit this decision when there is a use case
>>>>>>> where the checksumming performance is an actual bottleneck.
>>>>>>>
>>>>>>> .
>>>>>> The mainstream network cards has an option to switch the csum pattern.
>>>>>> Users can determine the one who calculate csum, hardware or software.
>>>>>>
>>>>>> ethtool -K eth0 rx-checksum off
>>>>>> ethtool -K eth0 tx-checksum-ip-generic off
>>>>>>
>>>>>> What's more, there's some network features that may cause hardware
>>>>>> checksum not work, like gso ( not so sure). Which means, the software
>>>>>> checksum has its existing meaning.
>>>>>>
>>>>> This does not make any sense to me. Segmentation offload relies on the
>>>>> hardware generating the actual packets, and I don't see how it would
>>>>> be able to do that if it cannot generate the checksum as well.
>>>> I test on my platform of IP-forward scenery. The network card has checksum capability.
>>>> The hardware do checksum when gro feature is off. However, checksum is done by
>>>> software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
>>>> and the performance decreases 20% due to software checksum.
>>>>
>>>> The command I use is
>>>> ethtool -K eth0 gro off
>>>>
>>> But this is about IP forwarding, right? So GRO is enabled, which means
>>> the packets are combined at the rx side. So does this mean the kernel
>>> always recalculates the checksum in software in this case? Or only for
>>> forwarded packets, where I would expect the outgoing interface to
>>> recalculate the checksum if TX checksum offload is enabled.
>> OK, after digging into this a bit more (with the help of Ilias -
>> thanks!), I agree that there may be cases where we still rely on
>> software IP checksumming even when using offload capable hardware. So
>> I also agree that it makes sense to provide an optimized
>> implementation for arm64.
>>
>> However, I am not yet convinced that a SIMD implementation is worth
>> the hassle. I did some background reading [0] and came up with a
>> scalar arm64 assembler implementation [1] that is almost as fast on
>> Cortex-A57, and so I would like to get a feeling for how it performs
>> on other micro-architectures. (Do note that the code has not been
>> tested on big endian yet.)
>>
>> Lingyan, could you please compare the scalar performance with the NEON
>> performance on your CPU? Thanks.
> OK, I'll test it on my CPU. The experimental platform should be built again.
> I will inform you as soon as I get the results.
Below is the results tested on my platform. The performance of your patch is really nice.
The 2nd colomn is general do_csum now in Linux. The 3rd is your patch. The 4th is
neon realization. Last is neon realization without kernel_neon_begin/kernel_neon_end.
1000cycle general(ns) csum_ard(ns) csum_neon(ns) csum_neon_no_kerbegin(ns)
64B: 75690 40890 76710 57440
256B: 171740 54050 109640 63730
1023B: 553220 105930 155630 93520
1024B: 554680 103500 148610 86890
1500B: 793810 134540 164510 104590
2048B: 1070880 167800 178700 119570
4095B: 2091000 299140 249580 189740
4096B: 2091610 296760 244310 183130
The reason should be analyzed that data width of NEON instruction is twice than the
general registers while performance is not. The kernel_neon_begin/end() seems to cost
a lot. Other reasons may include complex code implementations due to lack of experience.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
> .
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2019-02-18 8:49 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-06 1:55 [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction Lingyan Huang
2019-01-06 8:26 ` Ard Biesheuvel
[not found] ` <9129b882-60f3-8046-0cb9-e0b2452a118d@huawei.com>
2019-01-08 13:54 ` Will Deacon
2019-01-09 2:03 ` huanglingyan (A)
2019-01-10 4:08 ` 胡海
2019-01-10 8:14 ` huanglingyan (A)
2019-01-16 16:46 ` Will Deacon
2019-01-18 1:07 ` huanglingyan (A)
2019-01-18 11:14 ` Ard Biesheuvel
2019-02-12 2:26 ` huanglingyan (A)
2019-02-12 7:07 ` Ard Biesheuvel
2019-02-13 8:42 ` huanglingyan (A)
2019-02-13 9:15 ` Ard Biesheuvel
2019-02-13 17:55 ` Ard Biesheuvel
2019-02-14 9:57 ` huanglingyan (A)
2019-02-18 8:49 ` huanglingyan (A) [this message]
2019-02-18 9:03 ` Ard Biesheuvel
2019-01-09 14:58 ` Dave Martin
2019-01-10 8:03 ` huanglingyan (A)
2019-01-10 13:53 ` Dave Martin
[not found] <1f065749-6676-6489-14ae-fdcfeeb3389c@huawei.com>
2019-01-07 6:11 ` huanglingyan (A)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5531d4f2-c4cd-822e-3c0f-b3cba6dc8e91@huawei.com \
--to=huanglingyan2@huawei.com \
--cc=ard.biesheuvel@linaro.org \
--cc=catalin.marinas@arm.com \
--cc=ilias.apalodimas@linaro.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=will.deacon@arm.com \
--cc=zhangshaokun@hisilicon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).