All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nathan Chancellor <nathan@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	Nick Desaulniers <ndesaulniers@google.com>
Subject: Re: [PATCH 0/2] xor: enable auto-vectorization in Clang
Date: Thu, 27 Jan 2022 14:36:57 -0700	[thread overview]
Message-ID: <YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161> (raw)
In-Reply-To: <20220127081227.2430-1-ardb@kernel.org>

Hi Ard,

On Thu, Jan 27, 2022 at 09:12:25AM +0100, Ard Biesheuvel wrote:
> Update the xor_blocks() prototypes so that the compiler understands that
> the inputs always refer to distinct regions of memory. This is implied
> by the existing implementations, as they use different granularities for
> the load/xor/store loops.
> 
> With that, we can fix the ARM/Clang version, which refuses to SIMD
> vectorize otherwise, and throws a spurious warning related to the GCC
> version being incompatible.
> 
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Nathan Chancellor <nathan@kernel.org>
> 
> Ard Biesheuvel (2):
>   lib/xor: make xor prototypes more friendely to compiler vectorization
>   crypto: arm/xor - make vectorized C code Clang-friendly

I tested multi_v7_defconfig + CONFIG_BTRFS=y (to get
CONFIG_XOR_BLOCKS=y) in QEMU 6.2.0 (10 boots) and the xor neon code gets
faster according to do_xor_speed():

mainline @ 626b2dda7651:

[    2.591449]    neon            :  1166 MB/sec
[    2.579454]    neon            :  1118 MB/sec
[    2.589061]    neon            :  1163 MB/sec
[    2.581827]    neon            :  1167 MB/sec
[    2.599079]    neon            :  1166 MB/sec
[    2.579252]    neon            :  1147 MB/sec
[    2.582637]    neon            :  1168 MB/sec
[    2.582872]    neon            :  1164 MB/sec
[    2.570671]    neon            :  1167 MB/sec
[    2.571830]    neon            :  1166 MB/sec

mainline @ 626b2dda7651 with series:

[    2.570227]    neon            :  1238 MB/sec
[    2.571642]    neon            :  1237 MB/sec
[    2.580370]    neon            :  1234 MB/sec
[    2.581966]    neon            :  1238 MB/sec
[    2.582313]    neon            :  1236 MB/sec
[    2.572291]    neon            :  1238 MB/sec
[    2.570625]    neon            :  1233 MB/sec
[    2.571897]    neon            :  1234 MB/sec
[    2.589616]    neon            :  1228 MB/sec
[    2.582449]    neon            :  1236 MB/sec

This series is currently broken for powerpc [1], as the functions in
arch/powerpc/lib/xor_vmx.c were not updated.

arch/powerpc/lib/xor_vmx.c:52:6: error: conflicting types for '__xor_altivec_2'
void __xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:9:6: note: previous declaration is here
void __xor_altivec_2(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:70:6: error: conflicting types for '__xor_altivec_3'
void __xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:11:6: note: previous declaration is here
void __xor_altivec_3(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:92:6: error: conflicting types for '__xor_altivec_4'
void __xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:14:6: note: previous declaration is here
void __xor_altivec_4(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:119:6: error: conflicting types for '__xor_altivec_5'
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:18:6: note: previous declaration is here
void __xor_altivec_5(unsigned long bytes, unsigned long * __restrict p1,
     ^
4 errors generated.

If I fix that up [2], it builds and resolves an instance of
-Wframe-larger-than= in the xor altivec code, as seen with
pmac32_defconfig.

Before this series:

arch/powerpc/lib/xor_vmx.c:119:6: error: stack frame size (1232) exceeds limit (1024) in '__xor_altivec_5' [-Werror,-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
1 error generated.

After this patch (with CONFIG_FRAME_WARN=100 and
CONFIG_PPC_DISABLE_WERROR=y):

arch/powerpc/lib/xor_vmx.c:52:6: warning: stack frame size (128) exceeds limit (100) in '__xor_altivec_2' [-Wframe-larger-than]
void __xor_altivec_2(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:71:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_3' [-Wframe-larger-than]
void __xor_altivec_3(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:95:6: warning: stack frame size (144) exceeds limit (100) in '__xor_altivec_4' [-Wframe-larger-than]
void __xor_altivec_4(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:124:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_5' [-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes,
     ^
4 warnings generated.

There is a similar performance gain as ARM according to do_xor_speed():

Before:

   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   219 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec

After:

   altivec         :   278 MB/sec
   altivec         :   276 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec

I did also build test arm64 and x86_64 and saw no errors. I did runtime
test arm64 for improvements and did not see any, which is good, since I
take that as meaning it was working fine before and there is no
regression.

Once the build error is fixed, consider this series:

Tested-by: Nathan Chancellor <nathan@kernel.org>

[1]: https://lore.kernel.org/r/202112310646.kuh2pXiG-lkp@intel.com/
[2]: https://github.com/ClangBuiltLinux/linux/issues/563#issuecomment-1005175153

Cheers,
Nathan

      parent reply	other threads:[~2022-01-27 21:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27  8:12 [PATCH 0/2] xor: enable auto-vectorization in Clang Ard Biesheuvel
2022-01-27  8:12 ` [PATCH 1/2] lib/xor: make xor prototypes more friendely to compiler vectorization Ard Biesheuvel
2022-01-27 17:14   ` kernel test robot
2022-02-05  4:26   ` Herbert Xu
2022-01-27  8:12 ` [PATCH 2/2] crypto: arm/xor - make vectorized C code Clang-friendly Ard Biesheuvel
2022-01-27 21:36 ` Nathan Chancellor [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YfMQeRVkNMHr81P9@dev-arch.archlinux-ax161 \
    --to=nathan@kernel.org \
    --cc=ardb@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-crypto@vger.kernel.org \
    --cc=ndesaulniers@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.