linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: sunrui26@huawei.com
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Re: [PATCH] arm64: crc: accelerated-crc32-by-64bytes
Date: Sat, 24 Nov 2018 10:56:41 +0100	[thread overview]
Message-ID: <CAKv+Gu-ffSgS-ZH58u5mNYnqR0++osS-PHn8oHW72o2kRfDYCw@mail.gmail.com> (raw)
In-Reply-To: <0DF8AA8C9CD0FB4ABF4E83674055F7E2D82974@dggemi508-mbx.china.huawei.com>

On Sat, 24 Nov 2018 at 07:42, sunrui <sunrui26@huawei.com> wrote:
>
>
> On Thu, 22 Nov 2018 at 02:50, sunrui <sunrui26@huawei.com> wrote:
> >
> >
> >
> > On Sun, 18 Nov 2018 at 23:30, Rui Sun <sunrui26@huawei.com> wrote:
> >
> > >
> >
> > > add 64 bytes loop to acceleration calculation
> >
> > >
> >
> >
> >
> > Can you share some performance numbers please?
> >
> >
> >
> > Also, we don't need 64 byte, 32 byte and 16 byte code paths: just make the 8 byte one a loop as well, and drop the 32 byte and 16 byte ones.
> >
> >
> >
> > --
> >
> >
> >
> > Consider of some processor has instruction N-way parallel function, with the increase of the data buf’s size, 64B loop will performance better than 16B loop.
> >
> >
> >
> > On the other hand, in the same environment I tested the 8B loop, which is worse than the 16-byte loop.
> >
> >
> >
> > The test result is shown in the fellow excel(crc test result.xlsx)
> > sheet1(64B loop) and sheet2(8B loop)
> >
> >
> >Maybe I phrased that wrong: if we add the 64-byte loop, there is no need for a 32-byte block, a 16 byte block and a 8 byte block, since they all use the same crc32x instruction. After the 64-byte loop, just loop in the 8-byte sequence until the remaining data is less than 8 bytes.
> >
> >
> >
> I think we should not use 8-byte loop after 64-byte loop. Although the number of code lines is reduced, but it will run more subs and b.cond instruction. I test it and shown the result in the fellow excel.
>

OK

> Why I used three temp variables to do the ldp below is because our processor have two load/store unit, if we use the registers which are independent, it can processed in parallel.
>

Yes, but you are adding three instructions to a tight loop, which will
be noticeable on in-order cores.

Just use something like

ldp x3, x4, [x0]
ldp x5, x6, [x0, #16]
ldp x7, x8, [x0, #32]
ldp x9, x10, [x0, #48]
add x0, x0, #64

Those are completely independent as well

> By the way,  In most cases, crc short XOR 0xffffffff before and after the calculation, if we add 'mvn w0, w0' at the beginning and before the return will bring some benefits. What do you think about it?

The C code will take care of that.

> >
> >
> > --
> >
> >
> >
> > > Signed-off-by: Rui Sun <sunrui26@huawei.com>
> >
> > > ---
> >
> > >  arch/arm64/lib/crc32.S | 54
> >
> > > ++++++++++++++++++++++++++++++++++++++++++++++----
> >
> > >  1 file changed, 50 insertions(+), 4 deletions(-)
> >
> > >
> >
> > > diff --git a/arch/arm64/lib/crc32.S b/arch/arm64/lib/crc32.S index
> >
> > > 5bc1e85..2b37009 100644
> >
> > > --- a/arch/arm64/lib/crc32.S
> >
> > > +++ b/arch/arm64/lib/crc32.S
> >
> > > @@ -15,15 +15,61 @@
> >
> > >         .cpu            generic+crc
> >
> > >
> >
> > >         .macro          __crc32, c
> >
> > > -0:     subs            x2, x2, #16
> >
> > > -       b.mi            8f
> >
> > > +
> >
> > > +64: cmp     x2, #64
> >
> > > +    b.lt    32f
> >
> > > +
> >
> > > +    adds    x11, x1, #16
> >
> > > +    adds    x12, x1, #32
> >
> > > +    adds    x13, x1, #48
> >
> > > +
> >
> > > +0 : subs    x2, x2, #64
> >
> > > +    b.mi    32f
> >
> > > +
> >
> > > +    ldp     x3, x4, [x1], #64
> >
> > > +    ldp     x5, x6, [x11], #64
> >
> > > +    ldp     x7, x8, [x12], #64
> >
> > > +    ldp     x9, x10,[x13], #64
> >
> > > +
> >
> >
> >
> > Can we do this instead, and get rid of the temp variables?
> >
> >
> >
> >     ldp     x3, x4, [x1], #64
> >
> >     ldp     x5, x6, [x1, #-48]
> >
> >     ldp     x7, x8, [x1, #-32]
> >
> >     ldp     x9, x10,[x1, #-16]
> >
> >
> >
> > > + CPU_BE( rev     x3, x3      )
> >
> > > + CPU_BE( rev     x4, x4      )
> >
> > > + CPU_BE( rev     x5, x5      )
> >
> > > + CPU_BE( rev     x6, x6      )
> >
> > > + CPU_BE( rev     x7, x7      )
> >
> > > + CPU_BE( rev     x8, x8      )
> >
> > > + CPU_BE( rev     x9, x9      )
> >
> > > + CPU_BE( rev     x10,x10     )
> >
> > > +
> >
> > > +    crc32\c\()x w0, w0, x3
> >
> > > +    crc32\c\()x w0, w0, x4
> >
> > > +    crc32\c\()x w0, w0, x5
> >
> > > +    crc32\c\()x w0, w0, x6
> >
> > > +    crc32\c\()x w0, w0, x7
> >
> > > +    crc32\c\()x w0, w0, x8
> >
> > > +    crc32\c\()x w0, w0, x9
> >
> > > +    crc32\c\()x w0, w0, x10
> >
> > > +
> >
> > > +    b.ne       0b
> >
> > > +    ret
> >
> > > +
> >
> > > +32: tbz     x2, #5, 16f
> >
> > > +    ldp     x3, x4, [x1], #16
> >
> > > +    ldp     x5, x6, [x1], #16
> >
> > > +CPU_BE( rev     x3, x3      )
> >
> > > +CPU_BE( rev     x4, x4      )
> >
> > > +CPU_BE( rev     x5, x5      )
> >
> > > +CPU_BE( rev     x6, x6      )
> >
> > > +    crc32\c\()x w0, w0, x3
> >
> > > +    crc32\c\()x w0, w0, x4
> >
> > > +    crc32\c\()x w0, w0, x5
> >
> > > +    crc32\c\()x w0, w0, x6
> >
> > > +
> >
> > > +16: tbz     x2, #4, 8f
> >
> > >         ldp             x3, x4, [x1], #16
> >
> > >  CPU_BE(        rev             x3, x3          )
> >
> > >  CPU_BE(        rev             x4, x4          )
> >
> > >         crc32\c\()x     w0, w0, x3
> >
> > >         crc32\c\()x     w0, w0, x4
> >
> > > -       b.ne            0b
> >
> > > -       ret
> >
> > >
> >
> > >  8:     tbz             x2, #3, 4f
> >
> > >         ldr             x3, [x1], #8
> >
> > > --
> >
> > > 1.8.3.1
> >
> > >
> >
> >

  reply	other threads:[~2018-11-24  9:56 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-24  6:41 Re: [PATCH] arm64: crc: accelerated-crc32-by-64bytes sunrui
2018-11-24  9:56 ` Ard Biesheuvel [this message]
2018-11-24 11:51   ` Ard Biesheuvel
     [not found] <0DF8AA8C9CD0FB4ABF4E83674055F7E2D7FFCC@dggemi508-mbx.china.huawei.com>
2018-11-22  8:55 ` Ard Biesheuvel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKv+Gu-ffSgS-ZH58u5mNYnqR0++osS-PHn8oHW72o2kRfDYCw@mail.gmail.com \
    --to=ard.biesheuvel@linaro.org \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sunrui26@huawei.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).