From: Arnd Bergmann <arnd@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, Eric Biggers <ebiggers@kernel.org>, linux-arch <linux-arch@vger.kernel.org>, Vineet Gupta <vgupta@synopsys.com>, Russell King <linux@armlinux.org.uk>, Herbert Xu <herbert@gondor.apana.org.au>, "David S. Miller" <davem@davemloft.net>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>, Linux ARM <linux-arm-kernel@lists.infradead.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" <linux-crypto@vger.kernel.org>, "open list:BROADCOM NVRAM DRIVER" <linux-mips@vger.kernel.org>, Nobuhiro Iwamatsu <iwamatsu@debian.org> Subject: Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers Date: Tue, 18 May 2021 22:51:23 +0200 [thread overview] Message-ID: <CAK8P3a0QMjP-i7aw_CBRHPu7ffzX0p_vYF_SRtpd_iB8HW5TqQ@mail.gmail.com> (raw) In-Reply-To: <CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com> On Tue, May 18, 2021 at 6:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote: > > Of the other cases, that xor-neon.c case actually makes sense. For > that file, it literally exists _only_ to get a vectorized version of > the trivial xor_8regs loop. It's one of the (very very few) cases of > vectorization we actually want. And in that case, we might even want > to make things easier - and more explicit - for the compiler by making > the xor_8regs loops use "restrict" pointers. > > That neon case actually wants and needs that tree-vectorization to > DTRT. But maybe it doesn't need the actual _loop_ vectorization? The > xor_8regs code is literally using hand-unrolled loops already, exactly > to make it as simple as possible for the compiler (but the lack of > "restrict" pointers means that it's not all that simple after all, and > I assume the compiler generates conditionals for the NEON case? Right, I think there is an ongoing debate over how to best handle this one in clang, since that does not do any vectorization for this file unless the pointers are marked "restrict". As far as I remember, there are some callers that want to do the xor in place though, which means restrict is wrong. > lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but > it also very much uses unaligned accesses, which is where the gcc bug > hits. I doubt that it really needs or wants the loop vectorization. I ran some limited speed tests with the lz4 sources that come with Ubuntu, using gcc-10.3 on an AMD Zen1 Threadripper with 10GB of /dev/urandom input. This package patches the sources to use -O2 and no vectorization, which turns out to be the fastest combination for my stupid test as well. The results are barely above noise, but it appears that -O2 -ftree-loop-vectorize makes it slightly slower than just -O2, while -O3 is even slower than that regardless of -fno-tree-loop-vectorize/-ftree-loop-vectorize. I see that Nobuhiro Iwamatsu (Cc'd) changed the Debian lz4 package to use -O2, but I don't see an explanation for it. I also see that the lz4 sources force -O2 on ppc64 because -O3 causes a 30% slowdown from vectorization. The kernel version is missing the bit that does this. > zstd looks very similar to lz4. > End result: at a minimum, I'd suggest using > "-fno-tree-loop-vectorize", although somebody should check that NEON > case. > And I still think that using O3 for anything halfway complicated > should be considered odd and need some strong numbers to enable. Agreed. I think there is a fairly strong case for just using -O2 on lz4 and backport that to stable. Searching for lz4 bugs with -O3 also finds several reports including one that I sent myself: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702 I see that user space zstd is built with -O3 in Debian, but it the changelog also lists "Improved : better speed on clang and gcc -O2, thanks to Eric Biggers", so maybe Eric has some useful ideas on whether we should just use -O2 for the in-kernel version. Arnd
WARNING: multiple messages have this Message-ID (diff)
From: Arnd Bergmann <arnd@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, Eric Biggers <ebiggers@kernel.org>, linux-arch <linux-arch@vger.kernel.org>, Vineet Gupta <vgupta@synopsys.com>, Russell King <linux@armlinux.org.uk>, Herbert Xu <herbert@gondor.apana.org.au>, "David S. Miller" <davem@davemloft.net>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>, Linux ARM <linux-arm-kernel@lists.infradead.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" <linux-crypto@vger.kernel.org>, "open list:BROADCOM NVRAM DRIVER" <linux-mips@vger.kernel.org>, Nobuhiro Iwamatsu <iwamatsu@debian.org> Subject: Re: [PATCH v2 07/13] asm-generic: unaligned always use struct helpers Date: Tue, 18 May 2021 22:51:23 +0200 [thread overview] Message-ID: <CAK8P3a0QMjP-i7aw_CBRHPu7ffzX0p_vYF_SRtpd_iB8HW5TqQ@mail.gmail.com> (raw) In-Reply-To: <CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com> On Tue, May 18, 2021 at 6:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote: > > Of the other cases, that xor-neon.c case actually makes sense. For > that file, it literally exists _only_ to get a vectorized version of > the trivial xor_8regs loop. It's one of the (very very few) cases of > vectorization we actually want. And in that case, we might even want > to make things easier - and more explicit - for the compiler by making > the xor_8regs loops use "restrict" pointers. > > That neon case actually wants and needs that tree-vectorization to > DTRT. But maybe it doesn't need the actual _loop_ vectorization? The > xor_8regs code is literally using hand-unrolled loops already, exactly > to make it as simple as possible for the compiler (but the lack of > "restrict" pointers means that it's not all that simple after all, and > I assume the compiler generates conditionals for the NEON case? Right, I think there is an ongoing debate over how to best handle this one in clang, since that does not do any vectorization for this file unless the pointers are marked "restrict". As far as I remember, there are some callers that want to do the xor in place though, which means restrict is wrong. > lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but > it also very much uses unaligned accesses, which is where the gcc bug > hits. I doubt that it really needs or wants the loop vectorization. I ran some limited speed tests with the lz4 sources that come with Ubuntu, using gcc-10.3 on an AMD Zen1 Threadripper with 10GB of /dev/urandom input. This package patches the sources to use -O2 and no vectorization, which turns out to be the fastest combination for my stupid test as well. The results are barely above noise, but it appears that -O2 -ftree-loop-vectorize makes it slightly slower than just -O2, while -O3 is even slower than that regardless of -fno-tree-loop-vectorize/-ftree-loop-vectorize. I see that Nobuhiro Iwamatsu (Cc'd) changed the Debian lz4 package to use -O2, but I don't see an explanation for it. I also see that the lz4 sources force -O2 on ppc64 because -O3 causes a 30% slowdown from vectorization. The kernel version is missing the bit that does this. > zstd looks very similar to lz4. > End result: at a minimum, I'd suggest using > "-fno-tree-loop-vectorize", although somebody should check that NEON > case. > And I still think that using O3 for anything halfway complicated > should be considered odd and need some strong numbers to enable. Agreed. I think there is a fairly strong case for just using -O2 on lz4 and backport that to stable. Searching for lz4 bugs with -O3 also finds several reports including one that I sent myself: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702 I see that user space zstd is built with -O3 in Debian, but it the changelog also lists "Improved : better speed on clang and gcc -O2, thanks to Eric Biggers", so maybe Eric has some useful ideas on whether we should just use -O2 for the in-kernel version. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2021-05-18 20:52 UTC|newest] Thread overview: 98+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-14 10:00 [PATCH v2 00/13] Unify asm/unaligned.h around struct helper Arnd Bergmann 2021-05-14 10:00 ` [OpenRISC] " Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 01/13] asm-generic: use asm-generic/unaligned.h for most architectures Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 02/13] openrisc: always use unaligned-struct header Arnd Bergmann 2021-05-14 10:00 ` [OpenRISC] " Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 03/13] sh: remove unaligned access for sh4a Arnd Bergmann 2021-05-14 10:34 ` John Paul Adrian Glaubitz 2021-05-14 12:22 ` Arnd Bergmann 2021-05-15 15:36 ` John Paul Adrian Glaubitz 2021-05-15 20:10 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 04/13] m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 05/13] powerpc: use linux/unaligned/le_struct.h on LE power7 Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 11:48 ` Segher Boessenkool 2021-05-14 11:48 ` Segher Boessenkool 2021-05-14 13:02 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 06/13] asm-generic: unaligned: remove byteshift helpers Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 07/13] asm-generic: unaligned always use struct helpers Arnd Bergmann 2021-05-14 10:00 ` Arnd Bergmann 2021-05-17 21:53 ` Eric Biggers 2021-05-17 21:53 ` Eric Biggers 2021-05-18 7:25 ` Arnd Bergmann 2021-05-18 7:25 ` Arnd Bergmann 2021-05-18 14:56 ` Linus Torvalds 2021-05-18 14:56 ` Linus Torvalds 2021-05-18 15:41 ` Arnd Bergmann 2021-05-18 15:41 ` Arnd Bergmann 2021-05-18 16:12 ` Linus Torvalds 2021-05-18 16:12 ` Linus Torvalds 2021-05-18 18:09 ` Jason A. Donenfeld 2021-05-18 18:09 ` Jason A. Donenfeld 2021-05-18 20:51 ` Arnd Bergmann [this message] 2021-05-18 20:51 ` Arnd Bergmann 2021-05-18 21:31 ` Eric Biggers 2021-05-18 21:31 ` Eric Biggers 2021-05-18 21:14 ` David Laight 2021-05-18 21:14 ` David Laight 2021-05-14 10:00 ` [PATCH v2 08/13] partitions: msdos: fix one-byte get_unaligned() Arnd Bergmann 2021-05-17 10:28 ` Christoph Hellwig 2021-05-17 10:44 ` Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 09/13] apparmor: use get_unaligned() only for multi-byte words Arnd Bergmann 2021-05-14 10:00 ` [PATCH v2 10/13] mwifiex: re-fix for unaligned accesses Arnd Bergmann 2021-05-15 6:22 ` Kalle Valo 2021-05-15 9:01 ` Arnd Bergmann 2021-05-15 18:23 ` Kalle Valo 2021-05-14 10:00 ` [PATCH v2 11/13] netpoll: avoid put_unaligned() on single character Arnd Bergmann 2021-05-14 10:01 ` [PATCH v2 12/13] asm-generic: uaccess: 1-byte access is always aligned Arnd Bergmann 2021-05-15 18:41 ` Randy Dunlap 2021-05-15 20:16 ` Arnd Bergmann 2021-05-14 10:01 ` [PATCH v2 13/13] asm-generic: simplify asm/unaligned.h Arnd Bergmann 2021-05-14 10:35 ` David Laight 2021-05-14 17:32 ` [PATCH v2 00/13] Unify asm/unaligned.h around struct helper Linus Torvalds 2021-05-14 17:32 ` [OpenRISC] " Linus Torvalds 2021-05-14 17:32 ` Linus Torvalds 2021-05-14 17:32 ` Linus Torvalds 2021-05-14 18:51 ` Vineet Gupta 2021-05-14 18:51 ` [OpenRISC] " Vineet Gupta 2021-05-14 18:51 ` Vineet Gupta 2021-05-14 18:51 ` Vineet Gupta 2021-05-14 19:22 ` Linus Torvalds 2021-05-14 19:22 ` [OpenRISC] " Linus Torvalds 2021-05-14 19:22 ` Linus Torvalds 2021-05-14 19:22 ` Linus Torvalds 2021-05-14 19:45 ` Vineet Gupta 2021-05-14 19:45 ` [OpenRISC] " Vineet Gupta 2021-05-14 19:45 ` Vineet Gupta 2021-05-14 19:45 ` Vineet Gupta 2021-05-14 20:19 ` Linus Torvalds 2021-05-14 20:19 ` [OpenRISC] " Linus Torvalds 2021-05-14 20:19 ` Linus Torvalds 2021-05-14 20:19 ` Linus Torvalds 2021-05-14 19:31 ` Arnd Bergmann 2021-05-14 19:31 ` [OpenRISC] " Arnd Bergmann 2021-05-14 19:31 ` Arnd Bergmann 2021-05-14 19:31 ` Arnd Bergmann 2021-12-16 17:29 ` Ard Biesheuvel 2021-12-16 17:29 ` [OpenRISC] " Ard Biesheuvel 2021-12-16 17:29 ` Ard Biesheuvel 2021-12-16 17:42 ` Linus Torvalds 2021-12-16 17:42 ` [OpenRISC] " Linus Torvalds 2021-12-16 17:42 ` Linus Torvalds 2021-12-16 17:49 ` David Laight 2021-12-16 17:49 ` [OpenRISC] " David Laight 2021-12-16 17:49 ` David Laight 2021-12-16 18:56 ` Segher Boessenkool 2021-12-16 18:56 ` [OpenRISC] " Segher Boessenkool 2021-12-16 18:56 ` Segher Boessenkool 2021-12-17 12:34 ` David Laight 2021-12-17 12:34 ` [OpenRISC] " David Laight 2021-12-17 12:34 ` David Laight 2021-12-17 13:35 ` Segher Boessenkool 2021-12-17 13:35 ` [OpenRISC] " Segher Boessenkool 2021-12-17 13:35 ` Segher Boessenkool
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAK8P3a0QMjP-i7aw_CBRHPu7ffzX0p_vYF_SRtpd_iB8HW5TqQ@mail.gmail.com \ --to=arnd@kernel.org \ --cc=Jason@zx2c4.com \ --cc=davem@davemloft.net \ --cc=ebiggers@kernel.org \ --cc=herbert@gondor.apana.org.au \ --cc=iwamatsu@debian.org \ --cc=linux-arch@vger.kernel.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-crypto@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mips@vger.kernel.org \ --cc=linux@armlinux.org.uk \ --cc=torvalds@linux-foundation.org \ --cc=tsbogend@alpha.franken.de \ --cc=vgupta@synopsys.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.