All of lore.kernel.org
 help / color / mirror / Atom feed
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: Eric Dumazet <edumazet@google.com>
Cc: tglx@linutronix.de, mingo@redhat.com,
	Borislav Petkov <bp@alien8.de>,
	dave.hansen@linux.intel.com, X86 ML <x86@kernel.org>,
	hpa@zytor.com, peterz@infradead.org, alexanderduyck@fb.com,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
Date: Thu, 25 Nov 2021 20:18:49 -0600	[thread overview]
Message-ID: <CAFUsyfLKqonuKAh4k2qdBa24H1wQtR5FkAmmtXQGBpyizi6xvQ@mail.gmail.com> (raw)
In-Reply-To: <CAFUsyfK-znRWJN7FTMdJaDTd45DgtBQ9ckKGyh8qYqn0eFMMFA@mail.gmail.com>

On Thu, Nov 25, 2021 at 8:15 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > Modify the 8x loop to that it uses two independent
> > > accumulators. Despite adding more instructions the latency and
> > > throughput of the loop is improved because the `adc` chains can now
> > > take advantage of multiple execution units.
> >
> > Nice !
> >
> > Note that I get better results if I do a different split, because the
> > second chain gets shorter.
> >
> > First chain adds 5*8 bytes from the buffer, but first bytes are a mere
> > load, so that is really 4+1 additions.
> >
> > Second chain adds 3*8 bytes from the buffer, plus the result coming
> > from the first chain, also 4+1 additions.
>
> Good call. With that approach I also see an improvement for the 32 byte
> case (which is on the hot path) so this change might actually matter :)
>
> >
> > asm("movq 0*8(%[src]),%[res_tmp]\n\t"
> >     "addq 1*8(%[src]),%[res_tmp]\n\t"
> >     "adcq 2*8(%[src]),%[res_tmp]\n\t"
> >     "adcq 3*8(%[src]),%[res_tmp]\n\t"
> >     "adcq 4*8(%[src]),%[res_tmp]\n\t"
> >     "adcq $0,%[res_tmp]\n\t"
> >     "addq 5*8(%[src]),%[res]\n\t"
> >     "adcq 6*8(%[src]),%[res]\n\t"
> >     "adcq 7*8(%[src]),%[res]\n\t"
> >     "adcq %[res_tmp],%[res]\n\t"
> >     "adcq $0,%[res]"
> >     : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
> >     : [src] "r" (buff)
> >     : "memory");
> >
> >
> > >
> > > Make the memory clobbers more precise. 'buff' is read only and we know
> > > the exact usage range. There is no reason to write-clobber all memory.
> >
> > Not sure if that matters in this function ? Or do we expect it being inlined ?
>
> It may matter for LTO build. I also think it can matter for the loop
> case. I didn't see
> any difference when playing around with the function in userland with:
>
> ```
> gcc -O3 -march=native -mtune=native checksum.c -o checksum
> ```
>
> but IIRC if the clobber is loops with inline assembly payloads can be
> de-optimized if GCC can't prove the iterations don't affect each other.
>
>
> >
> > Personally, I find the "memory" constraint to be more readable than these casts
> > "m"(*(const char(*)[64])buff));
> >
>
> Hmm, I personally find it more readable if I can tell what memory
> transforms happen
> just from reading the clobbers, but you're the maintainer.
>
> Do you want it changed in V2?
>
> > >
> > > Relative performance changes on Tigerlake:
> > >
> > > Time Unit: Ref Cycles
> > > Size Unit: Bytes
> > >
> > > size,   lat old,    lat new,    tput old,   tput new
> > >    0,     4.972,      5.054,       4.864,      4.870
> >
> > Really what matters in modern networking is the case for 40 bytes, and
> > eventually 8 bytes.
> >
> > Can you add these two cases in this nice table ?
> >
>
> Sure, with your suggestion in the 32 byte cases there is an improvement there
> too.
>
> > We hardly have to checksum anything with NIC that are not decades old.
> >
> > Apparently making the 64byte loop slightly longer incentives  gcc to
> > move it away (our intent with the unlikely() hint).

Do you think the 40/48 byte case might be better of in GAS assembly. It's
a bit difficult to get proper control flow optimization with GCC +
inline assembly
even with likely/unlikely (i.e expanding the 64 byte case moves it off hotpath,
cmov + ror instead of fallthrough for hotpath).
> >
> > Anyway I am thinking of providing a specialized inline version for
> > IPv6 header checksums (40 + x*8 bytes, x being 0  pretty much all the
> > time),
> > so we will likely not use csum_partial() anymore.
>
> I see. For now is it worth adding a case for 40 in this implementation?
>
>     if(likely(len == 40)) {
>         // optimized 40 + buff aligned case
>     }
>     else {
>         // existing code
>     }
>
>
> >
> > Thanks !

  reply	other threads:[~2021-11-26  2:21 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein
2021-11-26  1:50 ` Eric Dumazet
2021-11-26  2:15   ` Noah Goldstein
2021-11-26  2:18     ` Noah Goldstein [this message]
2021-11-26  2:38       ` Noah Goldstein
2021-11-28 19:47         ` David Laight
2021-11-28 20:59           ` Noah Goldstein
2021-11-28 22:41             ` David Laight
2021-12-02 14:24             ` David Laight
2021-12-02 15:01               ` Eric Dumazet
2021-12-02 20:19                 ` Noah Goldstein
2021-12-02 21:11                   ` David Laight
2021-11-26 16:08     ` Eric Dumazet
2021-11-26 18:17       ` Noah Goldstein
2021-11-26 18:27         ` Eric Dumazet
2021-11-26 18:50           ` Noah Goldstein
2021-11-26 19:14             ` Noah Goldstein
2021-11-26 19:21               ` Eric Dumazet
2021-11-26 19:50                 ` Noah Goldstein
2021-11-26 20:07                   ` Eric Dumazet
2021-11-26 20:33                     ` Noah Goldstein
2021-11-27  0:15                       ` Eric Dumazet
2021-11-27  0:39                         ` Noah Goldstein
2021-11-26 18:17       ` Eric Dumazet
2021-11-27  4:25 ` [PATCH v2] " Noah Goldstein
2021-11-27  6:03   ` Eric Dumazet
2021-11-27  6:38     ` Noah Goldstein
2021-11-27  6:39 ` [PATCH v3] " Noah Goldstein
2021-11-27  6:51   ` Eric Dumazet
2021-11-27  7:18     ` Noah Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFUsyfLKqonuKAh4k2qdBa24H1wQtR5FkAmmtXQGBpyizi6xvQ@mail.gmail.com \
    --to=goldstein.w.n@gmail.com \
    --cc=alexanderduyck@fb.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=edumazet@google.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.