All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Stanisław Kardach" <kda@semihalf.com>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: "Mattias Rönnblom" <hofors@lysator.liu.se>,
	"Emil Berg" <emil.berg@ericsson.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>,
	dev <dev@dpdk.org>,
	"Stephen Hemminger" <stephen@networkplumber.org>,
	"dpdk stable" <stable@dpdk.org>,
	bugzilla@dpdk.org, "Olivier Matz" <olivier.matz@6wind.com>
Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
Date: Thu, 7 Jul 2022 17:21:17 +0200	[thread overview]
Message-ID: <CALVGJWKAdrNi2u6m-v5WPSWTCMej37qnN2gC=c5AVO-8od4cUA@mail.gmail.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D8719B@smartserver.smartshare.dk>

On Thu, Jun 30, 2022 at 6:32 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Tuesday, 28 June 2022 08.28
> >
> > On 2022-06-27 22:21, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >> Sent: Monday, 27 June 2022 19.23
> > >>
> > >> On 2022-06-27 15:22, Morten Brørup wrote:
> > >>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
> > >>>> Sent: Monday, 27 June 2022 14.51
> > >>>>
> > >>>>> From: Emil Berg
> > >>>>> Sent: den 27 juni 2022 14:46
> > >>>>>
> > >>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> > >>>>>> Sent: den 27 juni 2022 14:28
> > >>>>>>
> > >>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
> > >>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > >>>>>>>> Sent: Thursday, 23 June 2022 14.39
> > >>>>>>>>
> > >>>>>>>> With this patch, the checksum can be calculated on an
> > unaligned
> > >>>> buffer.
> > >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
> > >>>> aligned.
> > >>>>>>>>
> > >>>>>>>> The checksum is still calculated using a 16 bit aligned
> > pointer,
> > >>>> so
> > >>>>>>>> the compiler can auto-vectorize the function's inner loop.
> > >>>>>>>>
> > >>>>>>>> When the buffer is unaligned, the first byte of the buffer is
> > >>>>>>>> handled separately. Furthermore, the calculated checksum of
> > the
> > >>>>>>>> buffer is byte shifted before being added to the initial
> > >>>> checksum,
> > >>>>>>>> to compensate for the checksum having been calculated on the
> > >>>> buffer
> > >>>>>>>> shifted by one byte.
> > >>>>>>>>
> > >>>>>>>> v4:
> > >>>>>>>> * Add copyright notice.
> > >>>>>>>> * Include stdbool.h (Emil Berg).
> > >>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
> > >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> > >>>> word?
> > >>>>>>>> v3:
> > >>>>>>>> * Remove braces from single statement block.
> > >>>>>>>> * Fix typo in commit message.
> > >>>>>>>> v2:
> > >>>>>>>> * Do not assume that the buffer is part of an aligned packet
> > >>>> buffer.
> > >>>>>>>>
> > >>>>>>>> Bugzilla ID: 1035
> > >>>>>>>> Cc: stable@dpdk.org
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > [...]
> > >
> > >>>>>>
> > >>>>>> The compiler will be able to auto vectorize even unaligned
> > >>>> accesses,
> > >>>>>> just with different instructions. From what I can tell, there's
> > no
> > >>>>>> performance impact, at least not on the x86_64 systems I tried
> > on.
> > >>>>>>
> > >>>>>> I think you should remove the first special case conditional and
> > >>>> use
> > >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
> > >>>> retrieve
> > >>>>>> the data.
> > >>>>>>
> > >>>>>
> > >>>>> Here:
> > >>>>> https://www.agner.org/optimize/instruction_tables.pdf
> > >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> > >> latency
> > >>>> for
> > >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> > >>>> difference.
> > >>>>> Although in practice I'm not sure what difference it makes. I've
> > >> not
> > >>>> seen any
> > >>>>> difference in runtime between the two versions.
> > >>>>>
> > >>>>
> > >>>> Correction to my comment:
> > >>>> Those stats are for some older CPU. For some newer CPUs such as
> > >> Tiger
> > >>>> Lake the stats seem to be the same regardless of aligned or
> > >> unaligned.
> > >>>>
> > >>>
> > >>> I agree that the memcpy method is more elegant and easy to read.
> > >>>
> > >>> However, we would need to performance test the modified checksum
> > >> function with a large number of CPUs to prove that we don't
> > introduce a
> > >> performance regression on any CPU architecture still supported by
> > DPDK.
> > >> And Emil already found a CPU where it costs 1 extra cycle per 16
> > bytes,
> > >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> > >> packet.
> > >>>
> > >>
> > >> I think you've misunderstood what latency means in such tables. It's
> > a
> > >> data dependency thing, not a measure of throughput. The throughput
> > is
> > >> *much* higher. My guess would be two such instruction per clock.
> > >>
> > >> For your 1460 bytes example, my Zen3 AMD needs performs identical
> > with
> > >> both the current DPDK implementation, your patch, and a memcpy()-
> > ified
> > >> version of the current implementation. They all need ~130 clock
> > >> cycles/packet, with warm caches. IPC is 3 instructions per cycle,
> > but
> > >> obvious not all instructions are SIMD.
> > >
> > > You're right, I wasn't thinking deeper about it before extrapolating.
> > >
> > > Great to see some real numbers! I wish someone would do the same
> > testing on an old ARM CPU, so we could also see the other end of the
> > scale.
> > >
> >
> > I've ran it on an ARM A72. For the aligned 1460 bytes case I got:
> > Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They
> > performed about the same for all unaligned/aligned and sizes I tested.
> > This platform (or could be GCC version as well) doesn't suffer from the
> > unaligned performance degradation your patch showed on my AMD machine.
> >
> > >> The main issue with checksumming on the CPU is, in my experience,
> > not
> > >> that you don't have enough compute, but that you trash the caches.
> > >
> > > Agree. I have noticed that x86 has "non-temporal" instruction
> > variants to load/store data without trashing the cache entirely.
> > >
> > > A variant of the checksum function using such instructions might be
> > handy.
> > >
> >
> > Yes, although you may need to prefetch the payload for good
> > performance.
> >
> > > Variants of the memcpy function using such instructions might also be
> > handy for some purposes, e.g. copying the contents of packets, where
> > the original and/or copy will not accessed shortly thereafter.
> > >
> >
> > Indeed and I think it's been discussed on the list. There's some work
> > to
> > get it right, since alignment requirement and the fact a different
> > memory model is used for those SIMD instructions causes trouble for a
> > generic implementation. (For x86_64.)
>
> I just posted an RFC [1] for such memcpy() and memset() functions,
> so let's see how it fans out.
>
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk/T/#u
>
> >
> > >>> So I opted for a solution with zero changes to the inner loop, so
> > no
> > >> performance retesting is required (for the previously supported use
> > >> cases, where the buffer is aligned).
> > >>>
> > >>
> > >> You will see performance degradation with this solution as well,
> > under
> > >> certain conditions. For unaligned 100 bytes of data, the current
> > DPDK
> > >> implementation and the memcpy()-fied version needs ~21 cc/packet.
> > Your
> > >> patch needs 54 cc/packet.
> > >
> > > Yes, it's a tradeoff. I exclusively aimed at maintaining performance
> > for the case with aligned buffers (under all circumstances, with all
> > CPUs etc.), and ignored how it affects the performance for the case
> > with unaligned buffers.
> > >
> > > Unlike this patch, the memcpy() variant has no additional branches
> > for the unaligned case, so its performance should be generally
> > unaffected by the buffer being aligned or not. However, I don't have
> > sufficient in-depth CPU knowledge to say if this also applies to RISCV
> > and older ARM CPUs still supported by DPDK.
> > >
> >
> > I don't think avoiding RISCV non-catastrophic regressions triumphs
> > improving performance on mainstream CPUs and avoiding code quality
> > regressions.
> +1
+1. In general RISC-V spec leaves the unaligned load/store handling to
implementation (it might fault, it might not). The U74 core that I
have at hand allows unaligned reads/writes. Though it's not a platform
for performance evaluation (time measurement causes a trap to
firmware), so I won't say anything on that.


--
Best Regards,
Stanisław Kardach

  reply	other threads:[~2022-07-07 15:21 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-15  7:16 [Bug 1035] __rte_raw_cksum() crash with misaligned pointer bugzilla
2022-06-15 14:40 ` Morten Brørup
2022-06-16  5:44   ` Emil Berg
2022-06-16  6:27     ` Morten Brørup
2022-06-16  6:32     ` Emil Berg
2022-06-16  6:44       ` Morten Brørup
2022-06-16 13:58         ` Mattias Rönnblom
2022-06-16 14:36           ` Morten Brørup
2022-06-17  7:32           ` Morten Brørup
2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
2022-06-17  9:06               ` Morten Brørup
2022-06-17 12:17                 ` Emil Berg
2022-06-20 10:37                 ` Emil Berg
2022-06-20 10:57                   ` Morten Brørup
2022-06-21  7:16                     ` Emil Berg
2022-06-21  8:05                       ` Morten Brørup
2022-06-21  8:23                         ` Bruce Richardson
2022-06-21  9:35                           ` Morten Brørup
2022-06-22  6:26                             ` Emil Berg
2022-06-22  9:18                               ` Bruce Richardson
2022-06-22 11:26                                 ` Morten Brørup
2022-06-22 12:25                                   ` Emil Berg
2022-06-22 14:01                                     ` Morten Brørup
2022-06-22 14:03                                       ` Emil Berg
2022-06-23  5:21                                       ` Emil Berg
2022-06-23  7:01                                         ` Morten Brørup
2022-06-23 11:39                                           ` Emil Berg
2022-06-23 12:18                                             ` Morten Brørup
2022-06-22 13:44             ` [PATCH v2] " Morten Brørup
2022-06-22 13:54             ` [PATCH v3] " Morten Brørup
2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
2022-06-23 12:51               ` Morten Brørup
2022-06-27  7:56                 ` Emil Berg
2022-06-27 10:54                   ` Morten Brørup
2022-06-27 12:28                 ` Mattias Rönnblom
2022-06-27 12:46                   ` Emil Berg
2022-06-27 12:50                     ` Emil Berg
2022-06-27 13:22                       ` Morten Brørup
2022-06-27 17:22                         ` Mattias Rönnblom
2022-06-27 20:21                           ` Morten Brørup
2022-06-28  6:28                             ` Mattias Rönnblom
2022-06-30 16:28                               ` Morten Brørup
2022-07-07 15:21                                 ` Stanisław Kardach [this message]
2022-07-07 18:34                             ` [PATCH 1/2] app/test: add cksum performance test Mattias Rönnblom
2022-07-07 18:34                               ` [PATCH 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-07 21:44                                 ` Morten Brørup
2022-07-08 12:43                                   ` Mattias Rönnblom
2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-08 14:44                                         ` Ferruh Yigit
2022-07-11  9:53                                         ` Olivier Matz
2022-07-11 10:53                                           ` Mattias Rönnblom
2022-07-11  9:47                                       ` [PATCH v2 1/2] app/test: add cksum performance test Olivier Matz
2022-07-11 10:42                                         ` Mattias Rönnblom
2022-07-11 11:33                                           ` Olivier Matz
2022-07-11 12:11                                             ` [PATCH v3 " Mattias Rönnblom
2022-07-11 12:11                                               ` [PATCH v3 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-11 13:25                                                 ` Olivier Matz
2022-08-08  9:25                                                   ` Mattias Rönnblom
2022-09-20 12:09                                                   ` Mattias Rönnblom
2022-09-20 16:10                                                     ` Thomas Monjalon
2022-07-11 13:20                                               ` [PATCH v3 1/2] app/test: add cksum performance test Olivier Matz
2022-07-08 13:02                                     ` [PATCH 2/2] net: have checksum routines accept unaligned data Morten Brørup
2022-07-08 13:52                                       ` Mattias Rönnblom
2022-07-08 14:10                                         ` Bruce Richardson
2022-07-08 14:30                                           ` Morten Brørup
2022-06-30 17:41               ` [PATCH v4] net: fix checksum with unaligned buffer Stephen Hemminger
2022-06-30 17:45               ` Stephen Hemminger
2022-07-01  4:11                 ` Emil Berg
2022-07-01 16:50                   ` Morten Brørup
2022-07-01 17:04                     ` Stephen Hemminger
2022-07-01 20:46                       ` Morten Brørup
2022-06-16 14:09       ` [Bug 1035] __rte_raw_cksum() crash with misaligned pointer Mattias Rönnblom
2022-10-10 10:40 ` bugzilla

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALVGJWKAdrNi2u6m-v5WPSWTCMej37qnN2gC=c5AVO-8od4cUA@mail.gmail.com' \
    --to=kda@semihalf.com \
    --cc=bruce.richardson@intel.com \
    --cc=bugzilla@dpdk.org \
    --cc=dev@dpdk.org \
    --cc=emil.berg@ericsson.com \
    --cc=hofors@lysator.liu.se \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    --cc=stable@dpdk.org \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.