From: Gary Guo <gary@garyguo.net> To: David Laight <David.Laight@ACULAB.COM> Cc: 'Palmer Dabbelt' <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, "aou@eecs.berkeley.edu" <aou@eecs.berkeley.edu>, "nickhu@andestech.com" <nickhu@andestech.com>, "nylon7@andestech.com" <nylon7@andestech.com>, "linux-riscv@lists.infradead.org" <linux-riscv@lists.infradead.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign Date: Tue, 25 May 2021 15:34:32 +0100 [thread overview] Message-ID: <20210525153431.0000508d@garyguo.net> (raw) In-Reply-To: <17637b10e71b41b89126cbb1b2fa61cf@AcuMS.aculab.com> On Sun, 23 May 2021 17:12:23 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > From: Palmer Dabbelt > > Sent: 23 May 2021 02:47 > ... > > IMO the right way to go here is to just move to C-based string > > routines, at least until we get to the point where we're seriously > > optimizing for specific processors. We went with the C-based > > string rountines in glibc as part of the upstreaming process and > > found only some small performance differences when compared to the > > hand-written assembly, and they're way easier to maintain. I prefer C versions as well, and actually before commit 04091d6 we are indeed using the generic C version. The issue is that 04091d6 introduces an assembly version that's very broken. It does not offer and performance improvement to the C version, and breaks all processors without hardware misalignment support (yes, firmware is expected to trap and handle these, but they are painfully slow). I noticed the issue because I ran Linux on my own firmware and found that kernel couldn't boot. I didn't implement misalignment emulation at that time (and just send the trap to the supervisor). Because 04091d6 is accepted, my assumption is that we need an assembly version. So I spent some time writing, testing and optimising the assembly. > > > > IIRC Linux only has trivial C string routines in lib, I think the > > best way to go about that would be to higher performance versions > > in there. That will allow other ports to use them. > > I certainly wonder how much benefit these massively unrolled > loops have on modern superscaler processors - especially those > with any form of 'out of order' execution. > > It is often easy to write assembler where all the loop > control instructions happen in parallel with the memory > accesses - which cannot be avoided. > Loop unrolling is so 1970s. > > Sometimes you need to unroll once. > And maybe interleave the loads and stores. > But after that you can just be trashing the i-cache. I didn't introduce the loop unrolling though. The loop unrolled assembly is there before this patch, and I didn't even change the unroll factor. I only added a path to handle misaligned case. There are a lot of diffs because I did made some changes to the register allocation so that the code is more optimal. I also made a few cleanups and added a few comments. It might be easier to review if you apply the patch locally and just look at the file. - Gary
WARNING: multiple messages have this Message-ID (diff)
From: Gary Guo <gary@garyguo.net> To: David Laight <David.Laight@ACULAB.COM> Cc: 'Palmer Dabbelt' <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, "aou@eecs.berkeley.edu" <aou@eecs.berkeley.edu>, "nickhu@andestech.com" <nickhu@andestech.com>, "nylon7@andestech.com" <nylon7@andestech.com>, "linux-riscv@lists.infradead.org" <linux-riscv@lists.infradead.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] riscv: fix memmove and optimise memcpy when misalign Date: Tue, 25 May 2021 15:34:32 +0100 [thread overview] Message-ID: <20210525153431.0000508d@garyguo.net> (raw) In-Reply-To: <17637b10e71b41b89126cbb1b2fa61cf@AcuMS.aculab.com> On Sun, 23 May 2021 17:12:23 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > From: Palmer Dabbelt > > Sent: 23 May 2021 02:47 > ... > > IMO the right way to go here is to just move to C-based string > > routines, at least until we get to the point where we're seriously > > optimizing for specific processors. We went with the C-based > > string rountines in glibc as part of the upstreaming process and > > found only some small performance differences when compared to the > > hand-written assembly, and they're way easier to maintain. I prefer C versions as well, and actually before commit 04091d6 we are indeed using the generic C version. The issue is that 04091d6 introduces an assembly version that's very broken. It does not offer and performance improvement to the C version, and breaks all processors without hardware misalignment support (yes, firmware is expected to trap and handle these, but they are painfully slow). I noticed the issue because I ran Linux on my own firmware and found that kernel couldn't boot. I didn't implement misalignment emulation at that time (and just send the trap to the supervisor). Because 04091d6 is accepted, my assumption is that we need an assembly version. So I spent some time writing, testing and optimising the assembly. > > > > IIRC Linux only has trivial C string routines in lib, I think the > > best way to go about that would be to higher performance versions > > in there. That will allow other ports to use them. > > I certainly wonder how much benefit these massively unrolled > loops have on modern superscaler processors - especially those > with any form of 'out of order' execution. > > It is often easy to write assembler where all the loop > control instructions happen in parallel with the memory > accesses - which cannot be avoided. > Loop unrolling is so 1970s. > > Sometimes you need to unroll once. > And maybe interleave the loads and stores. > But after that you can just be trashing the i-cache. I didn't introduce the loop unrolling though. The loop unrolled assembly is there before this patch, and I didn't even change the unroll factor. I only added a path to handle misaligned case. There are a lot of diffs because I did made some changes to the register allocation so that the code is more optimal. I also made a few cleanups and added a few comments. It might be easier to review if you apply the patch locally and just look at the file. - Gary _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv
next prev parent reply other threads:[~2021-05-25 14:34 UTC|newest] Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-16 22:55 [PATCH] riscv: fix memmove and optimise memcpy when misalign Gary Guo 2021-02-16 22:55 ` Gary Guo 2021-05-13 8:13 ` Bin Meng 2021-05-13 8:13 ` Bin Meng 2021-05-22 22:22 ` Gary Guo 2021-05-22 22:22 ` Gary Guo 2021-05-23 1:47 ` Palmer Dabbelt 2021-05-23 1:47 ` Palmer Dabbelt 2021-05-23 17:12 ` David Laight 2021-05-23 17:12 ` David Laight 2021-05-25 14:34 ` Gary Guo [this message] 2021-05-25 14:34 ` Gary Guo 2021-06-15 13:40 ` Bin Meng 2021-06-15 13:40 ` Bin Meng 2021-06-15 14:08 ` David Laight 2021-06-15 14:08 ` David Laight
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210525153431.0000508d@garyguo.net \ --to=gary@garyguo.net \ --cc=David.Laight@ACULAB.COM \ --cc=aou@eecs.berkeley.edu \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-riscv@lists.infradead.org \ --cc=nickhu@andestech.com \ --cc=nylon7@andestech.com \ --cc=palmer@dabbelt.com \ --cc=paul.walmsley@sifive.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.