Re: [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction

From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: "huanglingyan (A)" <huanglingyan2@huawei.com>,
	 Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Zhangshaokun <zhangshaokun@hisilicon.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction
Date: Wed, 13 Feb 2019 18:55:14 +0100	[thread overview]
Message-ID: <CAKv+Gu-NuJCkeWqdK_pi4Vm7tyM4X=ZKqWdYg-m=Go2O6_fUrQ@mail.gmail.com> (raw)
In-Reply-To: <CAKv+Gu9XDneyLwidA+fGkpgJOb0owegYHNPJ2iLfqAovZox9GQ@mail.gmail.com>

(+ Ilias)

On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >
> >
> > On 2019/2/12 15:07, Ard Biesheuvel wrote:
> > > On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> > >>
> > >> On 2019/1/18 19:14, Ard Biesheuvel wrote:
> > >>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> > >>>> On 2019/1/17 0:46, Will Deacon wrote:
> > >>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> > >>>>>> On 2019/1/8 21:54, Will Deacon wrote:
> > >>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> > >>>>>>>
> > >>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> > >>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> > >>>>>>>>     Please change this into
> > >>>>>>>>
> > >>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> > >>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
> > >>>>>>>>         may_use_simd()) {
> > >>>>>>>>             kernel_neon_begin();
> > >>>>>>>>             res = do_csum_neon(buff, len);
> > >>>>>>>>             kernel_neon_end();
> > >>>>>>>>         }
> > >>>>>>>>
> > >>>>>>>>     and drop the intermediate do_csum_arm()
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>         +               return do_csum_arm(buff, len);
> > >>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> > >>>>>>>>
> > >>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>         +#undef do_csum
> > >>>>>>>>
> > >>>>>>>>     Can we drop this?
> > >>>>>>>>
> > >>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
> > >>>>>>>> when reservering/restoring
> > >>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> > >>>>>>>> is Only used when
> > >>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> > >>>>>>>> checksum.c will be used in
> > >>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> > >>>>>>>> to have the oppotunity to
> > >>>>>>>> utilize the general codes.
> > >>>>>>> I don't think that's how it works :/
> > >>>>>>>
> > >>>>>>> Before we get deeper into the implementation, please could you justify the
> > >>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
> > >>>>>>> usually offloaded to the NIC?
> > >>>>>>>
> > >>>>>>> Will
> > >>>>>>>
> > >>>>>>> .
> > >>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
> > >>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> > >>>>>> machine and then receive.
> > >>>>> In the marketing blurb, that card boasts:
> > >>>>>
> > >>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> > >>>>>
> > >>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> > >>>>> this given that it /really/ shouldn't be used on arm64 machines that care
> > >>>>> about network performance.
> > >>>>>
> > >>>>> Will
> > >>>>>
> > >>>>> .
> > >>>> Yeah, you are right. Checksum is usually done in network card which is told by
> > >>>> someone familiar with NIC. However, it may be used in testing scenaries and
> > >>>> some primary network cards. I think it's no harm to optimize this code while
> > >>>> other ARCHs have their own optimized versions.
> > >>> I disagree. If this code path is never exercised, we should not
> > >>> include it. We can revisit this decision when there is a use case
> > >>> where the checksumming performance is an actual bottleneck.
> > >>>
> > >>> .
> > >> The mainstream network cards has an option to switch the csum pattern.
> > >> Users can determine the one who calculate csum, hardware or software.
> > >>
> > >>         ethtool -K eth0 rx-checksum off
> > >>         ethtool -K eth0 tx-checksum-ip-generic off
> > >>
> > >> What's more, there's some network features that may cause hardware
> > >> checksum not work, like gso ( not so sure). Which means, the software
> > >> checksum has its existing meaning.
> > >>
> > > This does not make any sense to me. Segmentation offload relies on the
> > > hardware generating the actual packets, and I don't see how it would
> > > be able to do that if it cannot generate the checksum as well.
> > I test on my platform of  IP-forward scenery.  The network card has checksum capability.
> > The hardware do checksum when gro feature is off. However, checksum is done by
> > software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
> > and the performance decreases 20% due to software checksum.
> >
> > The command I use is
> >         ethtool -K eth0 gro off
> >
>
> But this is about IP forwarding, right? So GRO is enabled, which means
> the packets are combined at the rx side. So does this mean the kernel
> always recalculates the checksum in software in this case? Or only for
> forwarded packets, where I would expect the outgoing interface to
> recalculate the checksum if TX checksum offload is enabled.

OK, after digging into this a bit more (with the help of Ilias -
thanks!), I agree that there may be cases where we still rely on
software IP checksumming even when using offload capable hardware. So
I also agree that it makes sense to provide an optimized
implementation for arm64.

However, I am not yet convinced that a SIMD implementation is worth
the hassle. I did some background reading [0] and came up with a
scalar arm64 assembler implementation [1] that is almost as fast on
Cortex-A57, and so I would like to get a feeling for how it performs
on other micro-architectures. (Do note that the code has not been
tested on big endian yet.)

Lingyan, could you please compare the scalar performance with the NEON
performance on your CPU? Thanks.

-- 
Ard.

[0] https://locklessinc.com/articles/tcp_checksum/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=arm64-csum

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel