From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: sunrui26@huawei.com
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Re: [PATCH] arm64: crc: accelerated-crc32-by-64bytes
Date: Sat, 24 Nov 2018 10:56:41 +0100 [thread overview]
Message-ID: <CAKv+Gu-ffSgS-ZH58u5mNYnqR0++osS-PHn8oHW72o2kRfDYCw@mail.gmail.com> (raw)
In-Reply-To: <0DF8AA8C9CD0FB4ABF4E83674055F7E2D82974@dggemi508-mbx.china.huawei.com>
On Sat, 24 Nov 2018 at 07:42, sunrui <sunrui26@huawei.com> wrote:
>
>
> On Thu, 22 Nov 2018 at 02:50, sunrui <sunrui26@huawei.com> wrote:
> >
> >
> >
> > On Sun, 18 Nov 2018 at 23:30, Rui Sun <sunrui26@huawei.com> wrote:
> >
> > >
> >
> > > add 64 bytes loop to acceleration calculation
> >
> > >
> >
> >
> >
> > Can you share some performance numbers please?
> >
> >
> >
> > Also, we don't need 64 byte, 32 byte and 16 byte code paths: just make the 8 byte one a loop as well, and drop the 32 byte and 16 byte ones.
> >
> >
> >
> > --
> >
> >
> >
> > Consider of some processor has instruction N-way parallel function, with the increase of the data buf’s size, 64B loop will performance better than 16B loop.
> >
> >
> >
> > On the other hand, in the same environment I tested the 8B loop, which is worse than the 16-byte loop.
> >
> >
> >
> > The test result is shown in the fellow excel(crc test result.xlsx)
> > sheet1(64B loop) and sheet2(8B loop)
> >
> >
> >Maybe I phrased that wrong: if we add the 64-byte loop, there is no need for a 32-byte block, a 16 byte block and a 8 byte block, since they all use the same crc32x instruction. After the 64-byte loop, just loop in the 8-byte sequence until the remaining data is less than 8 bytes.
> >
> >
> >
> I think we should not use 8-byte loop after 64-byte loop. Although the number of code lines is reduced, but it will run more subs and b.cond instruction. I test it and shown the result in the fellow excel.
>
OK
> Why I used three temp variables to do the ldp below is because our processor have two load/store unit, if we use the registers which are independent, it can processed in parallel.
>
Yes, but you are adding three instructions to a tight loop, which will
be noticeable on in-order cores.
Just use something like
ldp x3, x4, [x0]
ldp x5, x6, [x0, #16]
ldp x7, x8, [x0, #32]
ldp x9, x10, [x0, #48]
add x0, x0, #64
Those are completely independent as well
> By the way, In most cases, crc short XOR 0xffffffff before and after the calculation, if we add 'mvn w0, w0' at the beginning and before the return will bring some benefits. What do you think about it?
The C code will take care of that.
> >
> >
> > --
> >
> >
> >
> > > Signed-off-by: Rui Sun <sunrui26@huawei.com>
> >
> > > ---
> >
> > > arch/arm64/lib/crc32.S | 54
> >
> > > ++++++++++++++++++++++++++++++++++++++++++++++----
> >
> > > 1 file changed, 50 insertions(+), 4 deletions(-)
> >
> > >
> >
> > > diff --git a/arch/arm64/lib/crc32.S b/arch/arm64/lib/crc32.S index
> >
> > > 5bc1e85..2b37009 100644
> >
> > > --- a/arch/arm64/lib/crc32.S
> >
> > > +++ b/arch/arm64/lib/crc32.S
> >
> > > @@ -15,15 +15,61 @@
> >
> > > .cpu generic+crc
> >
> > >
> >
> > > .macro __crc32, c
> >
> > > -0: subs x2, x2, #16
> >
> > > - b.mi 8f
> >
> > > +
> >
> > > +64: cmp x2, #64
> >
> > > + b.lt 32f
> >
> > > +
> >
> > > + adds x11, x1, #16
> >
> > > + adds x12, x1, #32
> >
> > > + adds x13, x1, #48
> >
> > > +
> >
> > > +0 : subs x2, x2, #64
> >
> > > + b.mi 32f
> >
> > > +
> >
> > > + ldp x3, x4, [x1], #64
> >
> > > + ldp x5, x6, [x11], #64
> >
> > > + ldp x7, x8, [x12], #64
> >
> > > + ldp x9, x10,[x13], #64
> >
> > > +
> >
> >
> >
> > Can we do this instead, and get rid of the temp variables?
> >
> >
> >
> > ldp x3, x4, [x1], #64
> >
> > ldp x5, x6, [x1, #-48]
> >
> > ldp x7, x8, [x1, #-32]
> >
> > ldp x9, x10,[x1, #-16]
> >
> >
> >
> > > + CPU_BE( rev x3, x3 )
> >
> > > + CPU_BE( rev x4, x4 )
> >
> > > + CPU_BE( rev x5, x5 )
> >
> > > + CPU_BE( rev x6, x6 )
> >
> > > + CPU_BE( rev x7, x7 )
> >
> > > + CPU_BE( rev x8, x8 )
> >
> > > + CPU_BE( rev x9, x9 )
> >
> > > + CPU_BE( rev x10,x10 )
> >
> > > +
> >
> > > + crc32\c\()x w0, w0, x3
> >
> > > + crc32\c\()x w0, w0, x4
> >
> > > + crc32\c\()x w0, w0, x5
> >
> > > + crc32\c\()x w0, w0, x6
> >
> > > + crc32\c\()x w0, w0, x7
> >
> > > + crc32\c\()x w0, w0, x8
> >
> > > + crc32\c\()x w0, w0, x9
> >
> > > + crc32\c\()x w0, w0, x10
> >
> > > +
> >
> > > + b.ne 0b
> >
> > > + ret
> >
> > > +
> >
> > > +32: tbz x2, #5, 16f
> >
> > > + ldp x3, x4, [x1], #16
> >
> > > + ldp x5, x6, [x1], #16
> >
> > > +CPU_BE( rev x3, x3 )
> >
> > > +CPU_BE( rev x4, x4 )
> >
> > > +CPU_BE( rev x5, x5 )
> >
> > > +CPU_BE( rev x6, x6 )
> >
> > > + crc32\c\()x w0, w0, x3
> >
> > > + crc32\c\()x w0, w0, x4
> >
> > > + crc32\c\()x w0, w0, x5
> >
> > > + crc32\c\()x w0, w0, x6
> >
> > > +
> >
> > > +16: tbz x2, #4, 8f
> >
> > > ldp x3, x4, [x1], #16
> >
> > > CPU_BE( rev x3, x3 )
> >
> > > CPU_BE( rev x4, x4 )
> >
> > > crc32\c\()x w0, w0, x3
> >
> > > crc32\c\()x w0, w0, x4
> >
> > > - b.ne 0b
> >
> > > - ret
> >
> > >
> >
> > > 8: tbz x2, #3, 4f
> >
> > > ldr x3, [x1], #8
> >
> > > --
> >
> > > 1.8.3.1
> >
> > >
> >
> >
next prev parent reply other threads:[~2018-11-24 9:56 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-24 6:41 Re: [PATCH] arm64: crc: accelerated-crc32-by-64bytes sunrui
2018-11-24 9:56 ` Ard Biesheuvel [this message]
2018-11-24 11:51 ` Ard Biesheuvel
[not found] <0DF8AA8C9CD0FB4ABF4E83674055F7E2D7FFCC@dggemi508-mbx.china.huawei.com>
2018-11-22 8:55 ` Ard Biesheuvel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKv+Gu-ffSgS-ZH58u5mNYnqR0++osS-PHn8oHW72o2kRfDYCw@mail.gmail.com \
--to=ard.biesheuvel@linaro.org \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sunrui26@huawei.com \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).