[PATCH 0/2] arm: lib: xor-neon: Remove warn & disble neon vect

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] arm: lib: xor-neon: Remove warn & disble neon vect
@ 2020-11-06  5:14 ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nathan Chancellor, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel, kernel

Dear all,

This is my attempt to close the loop on a relatively old discussion
[1] caused by a compiler bug [2]. In a nutshell, the Clang build issues
a bogus warning about GCC while it silently botches the neon auto-loop
vectorization. :)

Many thanks to all who have investigated this issue before me. Arnd
posted a workaround for xor.h [3], but I very much like his first
suggestion of disabling the broken feature until the compiler is fixed.

Tested on latest linux next-20201105 using bcm2835 & versatile configs
and Clang 10.0.1

P.S: While testing aarch64/imx8m I also noticed vectorization is broke
there as well, but that deserves its own patch because it's a separate
xor-neon implementation (if this approach is deemed sensible).

[1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/20190528235742.105510-1-natechancellor@gmail.com/
[2] https://bugs.llvm.org/show_bug.cgi?id=40976
[3] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

Kind regards,
Adrian

Adrian Ratiu (1):
  arm: lib: xor-neon: disable clang vectorization

Nathan Chancellor (1):
  arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

 arch/arm/include/asm/xor.h |  3 ++-
 arch/arm/lib/Makefile      |  3 +++
 arch/arm/lib/xor-neon.c    | 13 +++++--------
 3 files changed, 10 insertions(+), 9 deletions(-)

-- 
2.29.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/2] arm: lib: xor-neon: Remove warn & disble neon vect
@ 2020-11-06  5:14 ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, linux-kernel,
	clang-built-linux, Nathan Chancellor, kernel

Dear all,

This is my attempt to close the loop on a relatively old discussion
[1] caused by a compiler bug [2]. In a nutshell, the Clang build issues
a bogus warning about GCC while it silently botches the neon auto-loop
vectorization. :)

Many thanks to all who have investigated this issue before me. Arnd
posted a workaround for xor.h [3], but I very much like his first
suggestion of disabling the broken feature until the compiler is fixed.

Tested on latest linux next-20201105 using bcm2835 & versatile configs
and Clang 10.0.1

P.S: While testing aarch64/imx8m I also noticed vectorization is broke
there as well, but that deserves its own patch because it's a separate
xor-neon implementation (if this approach is deemed sensible).

[1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/20190528235742.105510-1-natechancellor@gmail.com/
[2] https://bugs.llvm.org/show_bug.cgi?id=40976
[3] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

Kind regards,
Adrian

Adrian Ratiu (1):
  arm: lib: xor-neon: disable clang vectorization

Nathan Chancellor (1):
  arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

 arch/arm/include/asm/xor.h |  3 ++-
 arch/arm/lib/Makefile      |  3 +++
 arch/arm/lib/xor-neon.c    | 13 +++++--------
 3 files changed, 10 insertions(+), 9 deletions(-)

-- 
2.29.0

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
  2020-11-06  5:14 ` Adrian Ratiu
@ 2020-11-06  5:14   ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nathan Chancellor, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel, kernel

From: Nathan Chancellor <natechancellor@gmail.com>

Drop warning because kernel now requires GCC >= v4.9 after
commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
---
 arch/arm/lib/xor-neon.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index b99dd8e1c93f..e1e76186ec23 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
  * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
  * NEON instructions.
  */
-#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
+#ifdef CONFIG_CC_IS_GCC
 #pragma GCC optimize "tree-vectorize"
-#else
-/*
- * While older versions of GCC do not generate incorrect code, they fail to
- * recognize the parallel nature of these functions, and emit plain ARM code,
- * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
- */
-#warning This code requires at least version 4.6 of GCC
 #endif
 
 #pragma GCC diagnostic ignored "-Wunused-variable"
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
@ 2020-11-06  5:14   ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, linux-kernel,
	clang-built-linux, Nathan Chancellor, kernel

From: Nathan Chancellor <natechancellor@gmail.com>

Drop warning because kernel now requires GCC >= v4.9 after
commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
---
 arch/arm/lib/xor-neon.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index b99dd8e1c93f..e1e76186ec23 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
  * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
  * NEON instructions.
  */
-#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
+#ifdef CONFIG_CC_IS_GCC
 #pragma GCC optimize "tree-vectorize"
-#else
-/*
- * While older versions of GCC do not generate incorrect code, they fail to
- * recognize the parallel nature of these functions, and emit plain ARM code,
- * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
- */
-#warning This code requires at least version 4.6 of GCC
 #endif
 
 #pragma GCC diagnostic ignored "-Wunused-variable"
-- 
2.29.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06  5:14 ` Adrian Ratiu
@ 2020-11-06  5:14   ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nathan Chancellor, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel, kernel

Due to a Clang bug [1] neon autoloop vectorization does not happen or
happens badly with no gains and considering previous GCC experiences
which generated unoptimized code which was worse than the default asm
implementation, it is safer to default clang builds to the known good
generic implementation.

The kernel currently supports a minimum Clang version of v10.0.1, see
commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").

When the bug gets eventually fixed, this commit could be reverted or,
if the minimum clang version bump takes a long time, a warning could
be added for users to upgrade their compilers like was done for GCC.

[1] https://bugs.llvm.org/show_bug.cgi?id=40976

Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
---
 arch/arm/include/asm/xor.h | 3 ++-
 arch/arm/lib/Makefile      | 3 +++
 arch/arm/lib/xor-neon.c    | 4 ++++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index aefddec79286..49937dafaa71 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
 		NEON_TEMPLATES;			\
 	} while (0)
 
-#ifdef CONFIG_KERNEL_MODE_NEON
+/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
+#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
 
 extern struct xor_block_template const xor_block_neon_inner;
 
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index 6d2ba454f25b..53f9e7dd9714 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -43,8 +43,11 @@ endif
 $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
 $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
 
+# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
+ifndef CONFIG_CC_IS_CLANG
 ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
   NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
   CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
   obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
 endif
+endif
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index e1e76186ec23..84c91c48dfa2 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
  * Pull in the reference implementations while instructing GCC (through
  * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
  * NEON instructions.
+
+ * On Clang the loop vectorizer is enabled by default, but due to a bug
+ * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
+ * so xor-neon is disabled in favor of the default reg implementations.
  */
 #ifdef CONFIG_CC_IS_GCC
 #pragma GCC optimize "tree-vectorize"
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-06  5:14   ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06  5:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, linux-kernel,
	clang-built-linux, Nathan Chancellor, kernel

Due to a Clang bug [1] neon autoloop vectorization does not happen or
happens badly with no gains and considering previous GCC experiences
which generated unoptimized code which was worse than the default asm
implementation, it is safer to default clang builds to the known good
generic implementation.

The kernel currently supports a minimum Clang version of v10.0.1, see
commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").

When the bug gets eventually fixed, this commit could be reverted or,
if the minimum clang version bump takes a long time, a warning could
be added for users to upgrade their compilers like was done for GCC.

[1] https://bugs.llvm.org/show_bug.cgi?id=40976

Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
---
 arch/arm/include/asm/xor.h | 3 ++-
 arch/arm/lib/Makefile      | 3 +++
 arch/arm/lib/xor-neon.c    | 4 ++++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index aefddec79286..49937dafaa71 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
 		NEON_TEMPLATES;			\
 	} while (0)
 
-#ifdef CONFIG_KERNEL_MODE_NEON
+/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
+#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
 
 extern struct xor_block_template const xor_block_neon_inner;
 
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index 6d2ba454f25b..53f9e7dd9714 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -43,8 +43,11 @@ endif
 $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
 $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
 
+# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
+ifndef CONFIG_CC_IS_CLANG
 ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
   NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
   CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
   obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
 endif
+endif
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index e1e76186ec23..84c91c48dfa2 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
  * Pull in the reference implementations while instructing GCC (through
  * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
  * NEON instructions.
+
+ * On Clang the loop vectorizer is enabled by default, but due to a bug
+ * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
+ * so xor-neon is disabled in favor of the default reg implementations.
  */
 #ifdef CONFIG_CC_IS_GCC
 #pragma GCC optimize "tree-vectorize"
-- 
2.29.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06  5:14   ` Adrian Ratiu
@ 2020-11-06 10:14     ` Nathan Chancellor
  -1 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 10:14 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: linux-arm-kernel, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel, kernel,
	Ard Biesheuvel

+ Ard, who wrote this code.

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> Due to a Clang bug [1] neon autoloop vectorization does not happen or
> happens badly with no gains and considering previous GCC experiences
> which generated unoptimized code which was worse than the default asm
> implementation, it is safer to default clang builds to the known good
> generic implementation.
> 
> The kernel currently supports a minimum Clang version of v10.0.1, see
> commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> 
> When the bug gets eventually fixed, this commit could be reverted or,
> if the minimum clang version bump takes a long time, a warning could
> be added for users to upgrade their compilers like was done for GCC.
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> 
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>

Thank you for the patch! We are also tracking this here:

https://github.com/ClangBuiltLinux/linux/issues/496

It was on my TODO to revist getting the warning eliminated, which likely
would have involved a patch like this as well.

I am curious if it is worth revisting or dusting off Arnd's patch in the
LLVM bug tracker first. I have not tried it personally. If that is not a
worthwhile option, I am fine with this for now. It would be nice to try
and get a fix pinned down on the LLVM side at some point but alas,
finite amount of resources and people :(

Should no other options come to fruition from further discussions, you
can carry my tag forward:

Acked-by: Nathan Chancellor <natechancellor@gmail.com>

Hopefully others can comment soon.

> ---
>  arch/arm/include/asm/xor.h | 3 ++-
>  arch/arm/lib/Makefile      | 3 +++
>  arch/arm/lib/xor-neon.c    | 4 ++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..49937dafaa71 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>  		NEON_TEMPLATES;			\
>  	} while (0)
>  
> -#ifdef CONFIG_KERNEL_MODE_NEON
> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>  
>  extern struct xor_block_template const xor_block_neon_inner;
>  
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index 6d2ba454f25b..53f9e7dd9714 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -43,8 +43,11 @@ endif
>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>  
> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> +ifndef CONFIG_CC_IS_CLANG
>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>  endif
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +
> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-06 10:14     ` Nathan Chancellor
  0 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 10:14 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, linux-kernel, Russell King,
	clang-built-linux, kernel, Ard Biesheuvel, linux-arm-kernel

+ Ard, who wrote this code.

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> Due to a Clang bug [1] neon autoloop vectorization does not happen or
> happens badly with no gains and considering previous GCC experiences
> which generated unoptimized code which was worse than the default asm
> implementation, it is safer to default clang builds to the known good
> generic implementation.
> 
> The kernel currently supports a minimum Clang version of v10.0.1, see
> commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> 
> When the bug gets eventually fixed, this commit could be reverted or,
> if the minimum clang version bump takes a long time, a warning could
> be added for users to upgrade their compilers like was done for GCC.
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> 
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>

Thank you for the patch! We are also tracking this here:

https://github.com/ClangBuiltLinux/linux/issues/496

It was on my TODO to revist getting the warning eliminated, which likely
would have involved a patch like this as well.

I am curious if it is worth revisting or dusting off Arnd's patch in the
LLVM bug tracker first. I have not tried it personally. If that is not a
worthwhile option, I am fine with this for now. It would be nice to try
and get a fix pinned down on the LLVM side at some point but alas,
finite amount of resources and people :(

Should no other options come to fruition from further discussions, you
can carry my tag forward:

Acked-by: Nathan Chancellor <natechancellor@gmail.com>

Hopefully others can comment soon.

> ---
>  arch/arm/include/asm/xor.h | 3 ++-
>  arch/arm/lib/Makefile      | 3 +++
>  arch/arm/lib/xor-neon.c    | 4 ++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..49937dafaa71 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>  		NEON_TEMPLATES;			\
>  	} while (0)
>  
> -#ifdef CONFIG_KERNEL_MODE_NEON
> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>  
>  extern struct xor_block_template const xor_block_neon_inner;
>  
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index 6d2ba454f25b..53f9e7dd9714 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -43,8 +43,11 @@ endif
>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>  
> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> +ifndef CONFIG_CC_IS_CLANG
>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>  endif
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +
> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06 10:14     ` Nathan Chancellor
@ 2020-11-06 11:50       ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06 11:50 UTC (permalink / raw)
  To: Nathan Chancellor, Arnd Bergmann
  Cc: linux-arm-kernel, Nick Desaulniers, clang-built-linux,
	Russell King, linux-kernel, kernel, Ard Biesheuvel

Hi Nathan,

On Fri, 06 Nov 2020, Nathan Chancellor <natechancellor@gmail.com> 
wrote:
> + Ard, who wrote this code. 
> 
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> Due to a Clang bug [1] neon autoloop vectorization does not 
>> happen or happens badly with no gains and considering previous 
>> GCC experiences which generated unoptimized code which was 
>> worse than the default asm implementation, it is safer to 
>> default clang builds to the known good generic implementation. 
>> The kernel currently supports a minimum Clang version of 
>> v10.0.1, see commit 1f7a44f63e6c ("compiler-clang: add build 
>> check for clang 10.0.1").   When the bug gets eventually fixed, 
>> this commit could be reverted or, if the minimum clang version 
>> bump takes a long time, a warning could be added for users to 
>> upgrade their compilers like was done for GCC.   [1] 
>> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> Adrian Ratiu <adrian.ratiu@collabora.com> 
> 
> Thank you for the patch! We are also tracking this here: 
> 
> https://github.com/ClangBuiltLinux/linux/issues/496 
> 
> It was on my TODO to revist getting the warning eliminated, 
> which likely would have involved a patch like this as well. 
> 
> I am curious if it is worth revisting or dusting off Arnd's 
> patch in the LLVM bug tracker first. I have not tried it 
> personally. If that is not a worthwhile option, I am fine with 
> this for now. It would be nice to try and get a fix pinned down 
> on the LLVM side at some point but alas, finite amount of 
> resources and people :( 

I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
with the Clang v10.0.1 I still get warnings like the following 
even though the __restrict workaround seems to affect the 
generated instructions:

./include/asm-generic/xor.h:15:2: remark: the cost-model indicates 
that interleaving is not beneficial [-Rpass-missed=loop-vectorize] 
./include/asm-generic/xor.h:11:1: remark: List vectorization was 
possible but not beneficial with cost 0 >= 0 
[-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
unsigned long *__restrict p1, unsigned long *__restrict p2)

[1] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

> 
> Should no other options come to fruition from further 
> discussions, you can carry my tag forward: 
> 
> Acked-by: Nathan Chancellor <natechancellor@gmail.com> 
> 
> Hopefully others can comment soon.

In my opinion we have 3 ways to go regarding this:

1. Leave it as is and try to notify the user of the breakage (eg 
add a new warning). You previously said this is not a good idea 
because the user can't do anything about it. I agree.

2. Somehow work around the compiler bug in the kernel which is 
what the LLVM bugtracker patch tries to do. This is a slippery 
slope even if we somehow get it right, especially since multiple 
Clang versions might be supported in the future and we don't know 
when the bug will be properly fixed by the compiler. In addition 
we're enabling and "hiding" possibly undefined behaviour.

3. Disable the broken feature and once the compiler bug is fixed 
enable it back warning users of old compilers that there is an 
action they can take: upgrade. This is exactly how this was 
handled for GCC previously, so there is a precedent.

This implements the 3'rd scenario which is also the first thing 
Arnd suggested in the original patch. :)

Adrian

>
>> ---
>>  arch/arm/include/asm/xor.h | 3 ++-
>>  arch/arm/lib/Makefile      | 3 +++
>>  arch/arm/lib/xor-neon.c    | 4 ++++
>>  3 files changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
>> index aefddec79286..49937dafaa71 100644
>> --- a/arch/arm/include/asm/xor.h
>> +++ b/arch/arm/include/asm/xor.h
>> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>>  		NEON_TEMPLATES;			\
>>  	} while (0)
>>  
>> -#ifdef CONFIG_KERNEL_MODE_NEON
>> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
>> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>>  
>>  extern struct xor_block_template const xor_block_neon_inner;
>>  
>> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
>> index 6d2ba454f25b..53f9e7dd9714 100644
>> --- a/arch/arm/lib/Makefile
>> +++ b/arch/arm/lib/Makefile
>> @@ -43,8 +43,11 @@ endif
>>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>>  
>> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
>> +ifndef CONFIG_CC_IS_CLANG
>>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>>  endif
>> +endif
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
>> index e1e76186ec23..84c91c48dfa2 100644
>> --- a/arch/arm/lib/xor-neon.c
>> +++ b/arch/arm/lib/xor-neon.c
>> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>>   * Pull in the reference implementations while instructing GCC (through
>>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>>   * NEON instructions.
>> +
>> + * On Clang the loop vectorizer is enabled by default, but due to a bug
>> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
>> + * so xor-neon is disabled in favor of the default reg implementations.
>>   */
>>  #ifdef CONFIG_CC_IS_GCC
>>  #pragma GCC optimize "tree-vectorize"
>> -- 
>> 2.29.0
>> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-06 11:50       ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-06 11:50 UTC (permalink / raw)
  To: Nathan Chancellor, Arnd Bergmann
  Cc: Nick Desaulniers, Russell King, linux-kernel, clang-built-linux,
	kernel, Ard Biesheuvel, linux-arm-kernel

Hi Nathan,

On Fri, 06 Nov 2020, Nathan Chancellor <natechancellor@gmail.com> 
wrote:
> + Ard, who wrote this code. 
> 
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> Due to a Clang bug [1] neon autoloop vectorization does not 
>> happen or happens badly with no gains and considering previous 
>> GCC experiences which generated unoptimized code which was 
>> worse than the default asm implementation, it is safer to 
>> default clang builds to the known good generic implementation. 
>> The kernel currently supports a minimum Clang version of 
>> v10.0.1, see commit 1f7a44f63e6c ("compiler-clang: add build 
>> check for clang 10.0.1").   When the bug gets eventually fixed, 
>> this commit could be reverted or, if the minimum clang version 
>> bump takes a long time, a warning could be added for users to 
>> upgrade their compilers like was done for GCC.   [1] 
>> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> Adrian Ratiu <adrian.ratiu@collabora.com> 
> 
> Thank you for the patch! We are also tracking this here: 
> 
> https://github.com/ClangBuiltLinux/linux/issues/496 
> 
> It was on my TODO to revist getting the warning eliminated, 
> which likely would have involved a patch like this as well. 
> 
> I am curious if it is worth revisting or dusting off Arnd's 
> patch in the LLVM bug tracker first. I have not tried it 
> personally. If that is not a worthwhile option, I am fine with 
> this for now. It would be nice to try and get a fix pinned down 
> on the LLVM side at some point but alas, finite amount of 
> resources and people :( 

I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
with the Clang v10.0.1 I still get warnings like the following 
even though the __restrict workaround seems to affect the 
generated instructions:

./include/asm-generic/xor.h:15:2: remark: the cost-model indicates 
that interleaving is not beneficial [-Rpass-missed=loop-vectorize] 
./include/asm-generic/xor.h:11:1: remark: List vectorization was 
possible but not beneficial with cost 0 >= 0 
[-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
unsigned long *__restrict p1, unsigned long *__restrict p2)

[1] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

> 
> Should no other options come to fruition from further 
> discussions, you can carry my tag forward: 
> 
> Acked-by: Nathan Chancellor <natechancellor@gmail.com> 
> 
> Hopefully others can comment soon.

In my opinion we have 3 ways to go regarding this:

1. Leave it as is and try to notify the user of the breakage (eg 
add a new warning). You previously said this is not a good idea 
because the user can't do anything about it. I agree.

2. Somehow work around the compiler bug in the kernel which is 
what the LLVM bugtracker patch tries to do. This is a slippery 
slope even if we somehow get it right, especially since multiple 
Clang versions might be supported in the future and we don't know 
when the bug will be properly fixed by the compiler. In addition 
we're enabling and "hiding" possibly undefined behaviour.

3. Disable the broken feature and once the compiler bug is fixed 
enable it back warning users of old compilers that there is an 
action they can take: upgrade. This is exactly how this was 
handled for GCC previously, so there is a precedent.

This implements the 3'rd scenario which is also the first thing 
Arnd suggested in the original patch. :)

Adrian

>
>> ---
>>  arch/arm/include/asm/xor.h | 3 ++-
>>  arch/arm/lib/Makefile      | 3 +++
>>  arch/arm/lib/xor-neon.c    | 4 ++++
>>  3 files changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
>> index aefddec79286..49937dafaa71 100644
>> --- a/arch/arm/include/asm/xor.h
>> +++ b/arch/arm/include/asm/xor.h
>> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>>  		NEON_TEMPLATES;			\
>>  	} while (0)
>>  
>> -#ifdef CONFIG_KERNEL_MODE_NEON
>> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
>> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>>  
>>  extern struct xor_block_template const xor_block_neon_inner;
>>  
>> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
>> index 6d2ba454f25b..53f9e7dd9714 100644
>> --- a/arch/arm/lib/Makefile
>> +++ b/arch/arm/lib/Makefile
>> @@ -43,8 +43,11 @@ endif
>>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>>  
>> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
>> +ifndef CONFIG_CC_IS_CLANG
>>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>>  endif
>> +endif
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
>> index e1e76186ec23..84c91c48dfa2 100644
>> --- a/arch/arm/lib/xor-neon.c
>> +++ b/arch/arm/lib/xor-neon.c
>> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>>   * Pull in the reference implementations while instructing GCC (through
>>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>>   * NEON instructions.
>> +
>> + * On Clang the loop vectorizer is enabled by default, but due to a bug
>> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
>> + * so xor-neon is disabled in favor of the default reg implementations.
>>   */
>>  #ifdef CONFIG_CC_IS_GCC
>>  #pragma GCC optimize "tree-vectorize"
>> -- 
>> 2.29.0
>> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
  2020-11-06  5:14   ` Adrian Ratiu
@ 2020-11-06 14:46     ` Arnd Bergmann
  -1 siblings, 0 replies; 56+ messages in thread
From: Arnd Bergmann @ 2020-11-06 14:46 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Linux ARM, Nathan Chancellor, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel,
	Collabora kernel ML

On Fri, Nov 6, 2020 at 6:14 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> From: Nathan Chancellor <natechancellor@gmail.com>
>
> Drop warning because kernel now requires GCC >= v4.9 after
> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").
>
> Reported-by: Nick Desaulniers <ndesaulniers@google.com>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
>  arch/arm/lib/xor-neon.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..e1e76186ec23 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
>   */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
>  #endif

I think we still need the #else path, otherwise we don't warn about
clang being broken here.

If it is intentional that we now silently build this code with clang
without it working as intended, that should be mentioned in the
changelog.

      Arnd

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
@ 2020-11-06 14:46     ` Arnd Bergmann
  0 siblings, 0 replies; 56+ messages in thread
From: Arnd Bergmann @ 2020-11-06 14:46 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, linux-kernel,
	clang-built-linux, Nathan Chancellor, Collabora kernel ML,
	Linux ARM

On Fri, Nov 6, 2020 at 6:14 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> From: Nathan Chancellor <natechancellor@gmail.com>
>
> Drop warning because kernel now requires GCC >= v4.9 after
> commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").
>
> Reported-by: Nick Desaulniers <ndesaulniers@google.com>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
>  arch/arm/lib/xor-neon.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..e1e76186ec23 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
>   */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
>  #endif

I think we still need the #else path, otherwise we don't warn about
clang being broken here.

If it is intentional that we now silently build this code with clang
without it working as intended, that should be mentioned in the
changelog.

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06 11:50       ` Adrian Ratiu
@ 2020-11-06 18:01         ` Nathan Chancellor
  -1 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 18:01 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, linux-arm-kernel, Nick Desaulniers,
	clang-built-linux, Russell King, linux-kernel, kernel,
	Ard Biesheuvel

Hi Adrian,

On Fri, Nov 06, 2020 at 01:50:13PM +0200, Adrian Ratiu wrote:
> I tested Arnd's kernel patch from the LLVM bugtracker [1], but with the
> Clang v10.0.1 I still get warnings like the following even though the
> __restrict workaround seems to affect the generated instructions:
> 
> ./include/asm-generic/xor.h:15:2: remark: the cost-model indicates that
> interleaving is not beneficial [-Rpass-missed=loop-vectorize]
> ./include/asm-generic/xor.h:11:1: remark: List vectorization was possible
> but not beneficial with cost 0 >= 0 [-Rpass-missed=slp-vectorizer]
> xor_8regs_2(unsigned long bytes, unsigned long *__restrict p1, unsigned long
> *__restrict p2)
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

Ack, thanks for double checking!

> In my opinion we have 3 ways to go regarding this:
> 
> 1. Leave it as is and try to notify the user of the breakage (eg add a new
> warning). You previously said this is not a good idea because the user can't
> do anything about it. I agree.
> 
> 2. Somehow work around the compiler bug in the kernel which is what the LLVM
> bugtracker patch tries to do. This is a slippery slope even if we somehow
> get it right, especially since multiple Clang versions might be supported in
> the future and we don't know when the bug will be properly fixed by the
> compiler. In addition we're enabling and "hiding" possibly undefined
> behaviour.
> 
> 3. Disable the broken feature and once the compiler bug is fixed enable it
> back warning users of old compilers that there is an action they can take:
> upgrade. This is exactly how this was handled for GCC previously, so there
> is a precedent.
> 
> This implements the 3'rd scenario which is also the first thing Arnd
> suggested in the original patch. :)

I agree that number three is definitely the most robust against the
future. I know that I periodically grep the tree for "bugs.llvm.org"
because we always file something on LLVM's bug tracker when we have to
work around something in the kernel. I think this patch is totally fine
as is, hopefully we can get it fixed in LLVM sooner rather than later.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-06 18:01         ` Nathan Chancellor
  0 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 18:01 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, linux-kernel, Russell King,
	clang-built-linux, kernel, Ard Biesheuvel, linux-arm-kernel

Hi Adrian,

On Fri, Nov 06, 2020 at 01:50:13PM +0200, Adrian Ratiu wrote:
> I tested Arnd's kernel patch from the LLVM bugtracker [1], but with the
> Clang v10.0.1 I still get warnings like the following even though the
> __restrict workaround seems to affect the generated instructions:
> 
> ./include/asm-generic/xor.h:15:2: remark: the cost-model indicates that
> interleaving is not beneficial [-Rpass-missed=loop-vectorize]
> ./include/asm-generic/xor.h:11:1: remark: List vectorization was possible
> but not beneficial with cost 0 >= 0 [-Rpass-missed=slp-vectorizer]
> xor_8regs_2(unsigned long bytes, unsigned long *__restrict p1, unsigned long
> *__restrict p2)
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976#c6

Ack, thanks for double checking!

> In my opinion we have 3 ways to go regarding this:
> 
> 1. Leave it as is and try to notify the user of the breakage (eg add a new
> warning). You previously said this is not a good idea because the user can't
> do anything about it. I agree.
> 
> 2. Somehow work around the compiler bug in the kernel which is what the LLVM
> bugtracker patch tries to do. This is a slippery slope even if we somehow
> get it right, especially since multiple Clang versions might be supported in
> the future and we don't know when the bug will be properly fixed by the
> compiler. In addition we're enabling and "hiding" possibly undefined
> behaviour.
> 
> 3. Disable the broken feature and once the compiler bug is fixed enable it
> back warning users of old compilers that there is an action they can take:
> upgrade. This is exactly how this was handled for GCC previously, so there
> is a precedent.
> 
> This implements the 3'rd scenario which is also the first thing Arnd
> suggested in the original patch. :)

I agree that number three is definitely the most robust against the
future. I know that I periodically grep the tree for "bugs.llvm.org"
because we always file something on LLVM's bug tracker when we have to
work around something in the kernel. I think this patch is totally fine
as is, hopefully we can get it fixed in LLVM sooner rather than later.

Cheers,
Nathan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
  2020-11-06 14:46     ` Arnd Bergmann
@ 2020-11-06 18:03       ` Nathan Chancellor
  -1 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 18:03 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Adrian Ratiu, Linux ARM, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel,
	Collabora kernel ML

On Fri, Nov 06, 2020 at 03:46:36PM +0100, Arnd Bergmann wrote:
> On Fri, Nov 6, 2020 at 6:14 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> >
> > From: Nathan Chancellor <natechancellor@gmail.com>
> >
> > Drop warning because kernel now requires GCC >= v4.9 after
> > commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").
> >
> > Reported-by: Nick Desaulniers <ndesaulniers@google.com>
> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> >  arch/arm/lib/xor-neon.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index b99dd8e1c93f..e1e76186ec23 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
> >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> >   * NEON instructions.
> >   */
> > -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> > +#ifdef CONFIG_CC_IS_GCC
> >  #pragma GCC optimize "tree-vectorize"
> > -#else
> > -/*
> > - * While older versions of GCC do not generate incorrect code, they fail to
> > - * recognize the parallel nature of these functions, and emit plain ARM code,
> > - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> > - */
> > -#warning This code requires at least version 4.6 of GCC
> >  #endif
> 
> I think we still need the #else path, otherwise we don't warn about
> clang being broken here.
> 
> If it is intentional that we now silently build this code with clang
> without it working as intended, that should be mentioned in the
> changelog.
> 
>       Arnd

Maybe patch 2/2 should come before this one? With that patch, we are not
even going to build this code so this patch purely becomes a "we do not
support this GCC version anymore" cleanup patch.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
@ 2020-11-06 18:03       ` Nathan Chancellor
  0 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2020-11-06 18:03 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, Adrian Ratiu, Nick Desaulniers, Russell King,
	linux-kernel, clang-built-linux, Collabora kernel ML, Linux ARM

On Fri, Nov 06, 2020 at 03:46:36PM +0100, Arnd Bergmann wrote:
> On Fri, Nov 6, 2020 at 6:14 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> >
> > From: Nathan Chancellor <natechancellor@gmail.com>
> >
> > Drop warning because kernel now requires GCC >= v4.9 after
> > commit 6ec4476ac825 ("Raise gcc version requirement to 4.9").
> >
> > Reported-by: Nick Desaulniers <ndesaulniers@google.com>
> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> >  arch/arm/lib/xor-neon.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index b99dd8e1c93f..e1e76186ec23 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -19,15 +19,8 @@ MODULE_LICENSE("GPL");
> >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> >   * NEON instructions.
> >   */
> > -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> > +#ifdef CONFIG_CC_IS_GCC
> >  #pragma GCC optimize "tree-vectorize"
> > -#else
> > -/*
> > - * While older versions of GCC do not generate incorrect code, they fail to
> > - * recognize the parallel nature of these functions, and emit plain ARM code,
> > - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> > - */
> > -#warning This code requires at least version 4.6 of GCC
> >  #endif
> 
> I think we still need the #else path, otherwise we don't warn about
> clang being broken here.
> 
> If it is intentional that we now silently build this code with clang
> without it working as intended, that should be mentioned in the
> changelog.
> 
>       Arnd

Maybe patch 2/2 should come before this one? With that patch, we are not
even going to build this code so this patch purely becomes a "we do not
support this GCC version anymore" cleanup patch.

Cheers,
Nathan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06 11:50       ` Adrian Ratiu
@ 2020-11-06 19:52         ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-06 19:52 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> Hi Nathan,
>
> On Fri, 06 Nov 2020, Nathan Chancellor <natechancellor@gmail.com>
> wrote:
> > + Ard, who wrote this code.
> >
> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> >> Due to a Clang bug [1] neon autoloop vectorization does not
> >> happen or happens badly with no gains and considering previous
> >> GCC experiences which generated unoptimized code which was
> >> worse than the default asm implementation, it is safer to
> >> default clang builds to the known good generic implementation.
> >> The kernel currently supports a minimum Clang version of
> >> v10.0.1, see commit 1f7a44f63e6c ("compiler-clang: add build
> >> check for clang 10.0.1").   When the bug gets eventually fixed,
> >> this commit could be reverted or, if the minimum clang version
> >> bump takes a long time, a warning could be added for users to
> >> upgrade their compilers like was done for GCC.   [1]
> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by:
> >> Adrian Ratiu <adrian.ratiu@collabora.com>
> >
> > Thank you for the patch! We are also tracking this here:
> >
> > https://github.com/ClangBuiltLinux/linux/issues/496
> >
> > It was on my TODO to revist getting the warning eliminated,
> > which likely would have involved a patch like this as well.
> >
> > I am curious if it is worth revisting or dusting off Arnd's
> > patch in the LLVM bug tracker first. I have not tried it
> > personally. If that is not a worthwhile option, I am fine with
> > this for now. It would be nice to try and get a fix pinned down
> > on the LLVM side at some point but alas, finite amount of
> > resources and people :(
>
> I tested Arnd's kernel patch from the LLVM bugtracker [1], but
> with the Clang v10.0.1 I still get warnings like the following
> even though the __restrict workaround seems to affect the
> generated instructions:
>
> ./include/asm-generic/xor.h:15:2: remark: the cost-model indicates
> that interleaving is not beneficial [-Rpass-missed=loop-vectorize]
> ./include/asm-generic/xor.h:11:1: remark: List vectorization was
> possible but not beneficial with cost 0 >= 0
> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes,
> unsigned long *__restrict p1, unsigned long *__restrict p2)

If it's just a matter of overruling the cost model
#pragma clang loop vectorize(enable)

will do the trick.

Indeed,
```
diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
index b62a2a56a4d4..8796955498b7 100644
--- a/include/asm-generic/xor.h
+++ b/include/asm-generic/xor.h
@@ -12,6 +12,7 @@ xor_8regs_2(unsigned long bytes, unsigned long *p1,
unsigned long *p2)
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0];
                p1[1] ^= p2[1];
@@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0];
                p1[1] ^= p2[1] ^ p3[1];
@@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0] ^ p4[0];
                p1[1] ^= p2[1] ^ p3[1] ^ p4[1];
@@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0];
                p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
```
seems to generate the vectorized code.

Why don't we find a way to make those pragma's more toolchain
portable, rather than open coding them like I have above rather than
this series?

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-06 19:52         ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-06 19:52 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> Hi Nathan,
>
> On Fri, 06 Nov 2020, Nathan Chancellor <natechancellor@gmail.com>
> wrote:
> > + Ard, who wrote this code.
> >
> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> >> Due to a Clang bug [1] neon autoloop vectorization does not
> >> happen or happens badly with no gains and considering previous
> >> GCC experiences which generated unoptimized code which was
> >> worse than the default asm implementation, it is safer to
> >> default clang builds to the known good generic implementation.
> >> The kernel currently supports a minimum Clang version of
> >> v10.0.1, see commit 1f7a44f63e6c ("compiler-clang: add build
> >> check for clang 10.0.1").   When the bug gets eventually fixed,
> >> this commit could be reverted or, if the minimum clang version
> >> bump takes a long time, a warning could be added for users to
> >> upgrade their compilers like was done for GCC.   [1]
> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by:
> >> Adrian Ratiu <adrian.ratiu@collabora.com>
> >
> > Thank you for the patch! We are also tracking this here:
> >
> > https://github.com/ClangBuiltLinux/linux/issues/496
> >
> > It was on my TODO to revist getting the warning eliminated,
> > which likely would have involved a patch like this as well.
> >
> > I am curious if it is worth revisting or dusting off Arnd's
> > patch in the LLVM bug tracker first. I have not tried it
> > personally. If that is not a worthwhile option, I am fine with
> > this for now. It would be nice to try and get a fix pinned down
> > on the LLVM side at some point but alas, finite amount of
> > resources and people :(
>
> I tested Arnd's kernel patch from the LLVM bugtracker [1], but
> with the Clang v10.0.1 I still get warnings like the following
> even though the __restrict workaround seems to affect the
> generated instructions:
>
> ./include/asm-generic/xor.h:15:2: remark: the cost-model indicates
> that interleaving is not beneficial [-Rpass-missed=loop-vectorize]
> ./include/asm-generic/xor.h:11:1: remark: List vectorization was
> possible but not beneficial with cost 0 >= 0
> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes,
> unsigned long *__restrict p1, unsigned long *__restrict p2)

If it's just a matter of overruling the cost model
#pragma clang loop vectorize(enable)

will do the trick.

Indeed,
```
diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
index b62a2a56a4d4..8796955498b7 100644
--- a/include/asm-generic/xor.h
+++ b/include/asm-generic/xor.h
@@ -12,6 +12,7 @@ xor_8regs_2(unsigned long bytes, unsigned long *p1,
unsigned long *p2)
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0];
                p1[1] ^= p2[1];
@@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0];
                p1[1] ^= p2[1] ^ p3[1];
@@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0] ^ p4[0];
                p1[1] ^= p2[1] ^ p3[1] ^ p4[1];
@@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long *p1,
unsigned long *p2,
 {
        long lines = bytes / (sizeof (long)) / 8;

+#pragma clang loop vectorize(enable)
        do {
                p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0];
                p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
```
seems to generate the vectorized code.

Why don't we find a way to make those pragma's more toolchain
portable, rather than open coding them like I have above rather than
this series?

-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
  2020-11-06 18:03       ` Nathan Chancellor
@ 2020-11-06 21:46         ` Arnd Bergmann
  -1 siblings, 0 replies; 56+ messages in thread
From: Arnd Bergmann @ 2020-11-06 21:46 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Adrian Ratiu, Linux ARM, Nick Desaulniers, Arnd Bergmann,
	clang-built-linux, Russell King, linux-kernel,
	Collabora kernel ML

On Fri, Nov 6, 2020 at 7:03 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
> On Fri, Nov 06, 2020 at 03:46:36PM +0100, Arnd Bergmann wrote:
> >
> > If it is intentional that we now silently build this code with clang
> > without it working as intended, that should be mentioned in the
> > changelog.
>
> Maybe patch 2/2 should come before this one? With that patch, we are not
> even going to build this code so this patch purely becomes a "we do not
> support this GCC version anymore" cleanup patch.

No, I just need to read properly, I somehow missed the fact that there was
a second patch in the series.

      Arnd

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
@ 2020-11-06 21:46         ` Arnd Bergmann
  0 siblings, 0 replies; 56+ messages in thread
From: Arnd Bergmann @ 2020-11-06 21:46 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Arnd Bergmann, Adrian Ratiu, Nick Desaulniers, Russell King,
	linux-kernel, clang-built-linux, Collabora kernel ML, Linux ARM

On Fri, Nov 6, 2020 at 7:03 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
> On Fri, Nov 06, 2020 at 03:46:36PM +0100, Arnd Bergmann wrote:
> >
> > If it is intentional that we now silently build this code with clang
> > without it working as intended, that should be mentioned in the
> > changelog.
>
> Maybe patch 2/2 should come before this one? With that patch, we are not
> even going to build this code so this patch purely becomes a "we do not
> support this GCC version anymore" cleanup patch.

No, I just need to read properly, I somehow missed the fact that there was
a second patch in the series.

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06  5:14   ` Adrian Ratiu
@ 2020-11-07 10:22     ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 56+ messages in thread
From: Russell King - ARM Linux admin @ 2020-11-07 10:22 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: linux-arm-kernel, Nathan Chancellor, Nick Desaulniers,
	Arnd Bergmann, clang-built-linux, linux-kernel, kernel

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +

Please tidy this up before submission; we normally continue the "*" for
blank lines in comment blocks. Thanks.

> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-07 10:22     ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 56+ messages in thread
From: Russell King - ARM Linux admin @ 2020-11-07 10:22 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, linux-kernel, clang-built-linux,
	Nathan Chancellor, kernel, linux-arm-kernel

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +

Please tidy this up before submission; we normally continue the "*" for
blank lines in comment blocks. Thanks.

> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06 19:52         ` Nick Desaulniers
@ 2020-11-07 18:07           ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-07 18:07 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> Hi Nathan, 
>> 
>> On Fri, 06 Nov 2020, Nathan Chancellor 
>> <natechancellor@gmail.com> wrote: 
>> > + Ard, who wrote this code. 
>> > 
>> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> >> Due to a Clang bug [1] neon autoloop vectorization does not 
>> >> happen or happens badly with no gains and considering 
>> >> previous GCC experiences which generated unoptimized code 
>> >> which was worse than the default asm implementation, it is 
>> >> safer to default clang builds to the known good generic 
>> >> implementation.  The kernel currently supports a minimum 
>> >> Clang version of v10.0.1, see commit 1f7a44f63e6c 
>> >> ("compiler-clang: add build check for clang 10.0.1").   When 
>> >> the bug gets eventually fixed, this commit could be reverted 
>> >> or, if the minimum clang version bump takes a long time, a 
>> >> warning could be added for users to upgrade their compilers 
>> >> like was done for GCC.   [1] 
>> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> >> Adrian Ratiu <adrian.ratiu@collabora.com> 
>> > 
>> > Thank you for the patch! We are also tracking this here: 
>> > 
>> > https://github.com/ClangBuiltLinux/linux/issues/496 
>> > 
>> > It was on my TODO to revist getting the warning eliminated, 
>> > which likely would have involved a patch like this as well. 
>> > 
>> > I am curious if it is worth revisting or dusting off Arnd's 
>> > patch in the LLVM bug tracker first. I have not tried it 
>> > personally. If that is not a worthwhile option, I am fine 
>> > with this for now. It would be nice to try and get a fix 
>> > pinned down on the LLVM side at some point but alas, finite 
>> > amount of resources and people :( 
>> 
>> I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
>> with the Clang v10.0.1 I still get warnings like the following 
>> even though the __restrict workaround seems to affect the 
>> generated instructions: 
>> 
>> ./include/asm-generic/xor.h:15:2: remark: the cost-model 
>> indicates that interleaving is not beneficial 
>> [-Rpass-missed=loop-vectorize] 
>> ./include/asm-generic/xor.h:11:1: remark: List vectorization 
>> was possible but not beneficial with cost 0 >= 0 
>> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
>> unsigned long *__restrict p1, unsigned long *__restrict p2) 
> 
> If it's just a matter of overruling the cost model #pragma clang 
> loop vectorize(enable) 
> 
> will do the trick. 
> 
> Indeed, ``` diff --git a/include/asm-generic/xor.h 
> b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7 
> 100644 --- a/include/asm-generic/xor.h +++ 
> b/include/asm-generic/xor.h @@ -12,6 +12,7 @@ 
> xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned 
> long *p2) 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0]; p1[1] ^= p2[1]; 
> @@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1]; 
> @@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^ 
>                 p3[1] ^ p4[1]; 
> @@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^= 
>                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
> ``` seems to generate the vectorized code. 
> 
> Why don't we find a way to make those pragma's more toolchain 
> portable, rather than open coding them like I have above rather 
> than this series?

Hi Nick,

Thank you very much for the suggestion.

I agree. If a toolchain portable way can be found to realiably 
trigger the optimization, I will gladly replace this patch. :)

Will work on it starting Monday then report back my findings or, 
if I can get it to work in a satisfying manner, send a v2 series 
directly.

The first patch is still needed because it's more of a general 
cleanup as Nathan correctly observed.

Regards,
Adrian

>
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-07 18:07           ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-07 18:07 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> Hi Nathan, 
>> 
>> On Fri, 06 Nov 2020, Nathan Chancellor 
>> <natechancellor@gmail.com> wrote: 
>> > + Ard, who wrote this code. 
>> > 
>> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> >> Due to a Clang bug [1] neon autoloop vectorization does not 
>> >> happen or happens badly with no gains and considering 
>> >> previous GCC experiences which generated unoptimized code 
>> >> which was worse than the default asm implementation, it is 
>> >> safer to default clang builds to the known good generic 
>> >> implementation.  The kernel currently supports a minimum 
>> >> Clang version of v10.0.1, see commit 1f7a44f63e6c 
>> >> ("compiler-clang: add build check for clang 10.0.1").   When 
>> >> the bug gets eventually fixed, this commit could be reverted 
>> >> or, if the minimum clang version bump takes a long time, a 
>> >> warning could be added for users to upgrade their compilers 
>> >> like was done for GCC.   [1] 
>> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> >> Adrian Ratiu <adrian.ratiu@collabora.com> 
>> > 
>> > Thank you for the patch! We are also tracking this here: 
>> > 
>> > https://github.com/ClangBuiltLinux/linux/issues/496 
>> > 
>> > It was on my TODO to revist getting the warning eliminated, 
>> > which likely would have involved a patch like this as well. 
>> > 
>> > I am curious if it is worth revisting or dusting off Arnd's 
>> > patch in the LLVM bug tracker first. I have not tried it 
>> > personally. If that is not a worthwhile option, I am fine 
>> > with this for now. It would be nice to try and get a fix 
>> > pinned down on the LLVM side at some point but alas, finite 
>> > amount of resources and people :( 
>> 
>> I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
>> with the Clang v10.0.1 I still get warnings like the following 
>> even though the __restrict workaround seems to affect the 
>> generated instructions: 
>> 
>> ./include/asm-generic/xor.h:15:2: remark: the cost-model 
>> indicates that interleaving is not beneficial 
>> [-Rpass-missed=loop-vectorize] 
>> ./include/asm-generic/xor.h:11:1: remark: List vectorization 
>> was possible but not beneficial with cost 0 >= 0 
>> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
>> unsigned long *__restrict p1, unsigned long *__restrict p2) 
> 
> If it's just a matter of overruling the cost model #pragma clang 
> loop vectorize(enable) 
> 
> will do the trick. 
> 
> Indeed, ``` diff --git a/include/asm-generic/xor.h 
> b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7 
> 100644 --- a/include/asm-generic/xor.h +++ 
> b/include/asm-generic/xor.h @@ -12,6 +12,7 @@ 
> xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned 
> long *p2) 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0]; p1[1] ^= p2[1]; 
> @@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1]; 
> @@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^ 
>                 p3[1] ^ p4[1]; 
> @@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^= 
>                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
> ``` seems to generate the vectorized code. 
> 
> Why don't we find a way to make those pragma's more toolchain 
> portable, rather than open coding them like I have above rather 
> than this series?

Hi Nick,

Thank you very much for the suggestion.

I agree. If a toolchain portable way can be found to realiably 
trigger the optimization, I will gladly replace this patch. :)

Will work on it starting Monday then report back my findings or, 
if I can get it to work in a satisfying manner, send a v2 series 
directly.

The first patch is still needed because it's more of a general 
cleanup as Nathan correctly observed.

Regards,
Adrian

>
> -- 
> Thanks,
> ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-07 10:22     ` Russell King - ARM Linux admin
@ 2020-11-07 18:12       ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-07 18:12 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: linux-arm-kernel, Nathan Chancellor, Nick Desaulniers,
	Arnd Bergmann, clang-built-linux, linux-kernel, kernel

On Sat, 07 Nov 2020, Russell King - ARM Linux admin 
<linux@armlinux.org.uk> wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c 
>> index e1e76186ec23..84c91c48dfa2 100644 --- 
>> a/arch/arm/lib/xor-neon.c +++ b/arch/arm/lib/xor-neon.c @@ 
>> -18,6 +18,10 @@ MODULE_LICENSE("GPL"); 
>>   * Pull in the reference implementations while instructing GCC 
>>   (through * -ftree-vectorize) to attempt to exploit implicit 
>>   parallelism and emit * NEON instructions. 
>> + 
> 
> Please tidy this up before submission; we normally continue the 
> "*" for blank lines in comment blocks. Thanks. 

Indeed, thank you. I will fix it if I don't replace this patch 
entirely with something similar to what Nick suggested.

Perhaps adding a checkpatch test for this is a good idea?

Adrian

>
>> + * On Clang the loop vectorizer is enabled by default, but due to a bug
>> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
>> + * so xor-neon is disabled in favor of the default reg implementations.
>>   */
>>  #ifdef CONFIG_CC_IS_GCC
>>  #pragma GCC optimize "tree-vectorize"
>> -- 
>> 2.29.0
>> 
>> 
>
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-07 18:12       ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-07 18:12 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Arnd Bergmann, Nick Desaulniers, linux-kernel, clang-built-linux,
	Nathan Chancellor, kernel, linux-arm-kernel

On Sat, 07 Nov 2020, Russell King - ARM Linux admin 
<linux@armlinux.org.uk> wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c 
>> index e1e76186ec23..84c91c48dfa2 100644 --- 
>> a/arch/arm/lib/xor-neon.c +++ b/arch/arm/lib/xor-neon.c @@ 
>> -18,6 +18,10 @@ MODULE_LICENSE("GPL"); 
>>   * Pull in the reference implementations while instructing GCC 
>>   (through * -ftree-vectorize) to attempt to exploit implicit 
>>   parallelism and emit * NEON instructions. 
>> + 
> 
> Please tidy this up before submission; we normally continue the 
> "*" for blank lines in comment blocks. Thanks. 

Indeed, thank you. I will fix it if I don't replace this patch 
entirely with something similar to what Nick suggested.

Perhaps adding a checkpatch test for this is a good idea?

Adrian

>
>> + * On Clang the loop vectorizer is enabled by default, but due to a bug
>> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
>> + * so xor-neon is disabled in favor of the default reg implementations.
>>   */
>>  #ifdef CONFIG_CC_IS_GCC
>>  #pragma GCC optimize "tree-vectorize"
>> -- 
>> 2.29.0
>> 
>> 
>
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06  5:14   ` Adrian Ratiu
@ 2020-11-08 17:40     ` Arvind Sankar
  -1 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-08 17:40 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: linux-arm-kernel, Nathan Chancellor, Nick Desaulniers,
	Arnd Bergmann, clang-built-linux, Russell King, linux-kernel,
	kernel

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> Due to a Clang bug [1] neon autoloop vectorization does not happen or
> happens badly with no gains and considering previous GCC experiences
> which generated unoptimized code which was worse than the default asm
> implementation, it is safer to default clang builds to the known good
> generic implementation.
> 
> The kernel currently supports a minimum Clang version of v10.0.1, see
> commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> 
> When the bug gets eventually fixed, this commit could be reverted or,
> if the minimum clang version bump takes a long time, a warning could
> be added for users to upgrade their compilers like was done for GCC.
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> 
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
>  arch/arm/include/asm/xor.h | 3 ++-
>  arch/arm/lib/Makefile      | 3 +++
>  arch/arm/lib/xor-neon.c    | 4 ++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..49937dafaa71 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>  		NEON_TEMPLATES;			\
>  	} while (0)
>  
> -#ifdef CONFIG_KERNEL_MODE_NEON
> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>  
>  extern struct xor_block_template const xor_block_neon_inner;
>  
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index 6d2ba454f25b..53f9e7dd9714 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -43,8 +43,11 @@ endif
>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>  
> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> +ifndef CONFIG_CC_IS_CLANG
>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>  endif
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +
> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 

It's actually a bad idea to use #pragma GCC optimize. This is basically
the same as tagging all the functions with __attribute__((optimize)),
which GCC does not recommend for production use, as it _replaces_
optimization options rather than appending to them, and has been
observed to result in dropping important compiler flags.

There've been a few discussions recently around other such cases:
https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/

For this file, given that it is supposed to use -ftree-vectorize for the
whole file anyway, is there any reason it's not just added to CFLAGS via
the Makefile? This seems to be the only use of pragma optimize in the
kernel.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-08 17:40     ` Arvind Sankar
  0 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-08 17:40 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, linux-kernel,
	clang-built-linux, Nathan Chancellor, kernel, linux-arm-kernel

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> Due to a Clang bug [1] neon autoloop vectorization does not happen or
> happens badly with no gains and considering previous GCC experiences
> which generated unoptimized code which was worse than the default asm
> implementation, it is safer to default clang builds to the known good
> generic implementation.
> 
> The kernel currently supports a minimum Clang version of v10.0.1, see
> commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> 
> When the bug gets eventually fixed, this commit could be reverted or,
> if the minimum clang version bump takes a long time, a warning could
> be added for users to upgrade their compilers like was done for GCC.
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> 
> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> ---
>  arch/arm/include/asm/xor.h | 3 ++-
>  arch/arm/lib/Makefile      | 3 +++
>  arch/arm/lib/xor-neon.c    | 4 ++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..49937dafaa71 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>  		NEON_TEMPLATES;			\
>  	} while (0)
>  
> -#ifdef CONFIG_KERNEL_MODE_NEON
> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>  
>  extern struct xor_block_template const xor_block_neon_inner;
>  
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index 6d2ba454f25b..53f9e7dd9714 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -43,8 +43,11 @@ endif
>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
>  
> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> +ifndef CONFIG_CC_IS_CLANG
>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
>    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>  endif
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +
> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 

It's actually a bad idea to use #pragma GCC optimize. This is basically
the same as tagging all the functions with __attribute__((optimize)),
which GCC does not recommend for production use, as it _replaces_
optimization options rather than appending to them, and has been
observed to result in dropping important compiler flags.

There've been a few discussions recently around other such cases:
https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/

For this file, given that it is supposed to use -ftree-vectorize for the
whole file anyway, is there any reason it's not just added to CFLAGS via
the Makefile? This seems to be the only use of pragma optimize in the
kernel.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-08 17:40     ` Arvind Sankar
@ 2020-11-08 18:09       ` Arvind Sankar
  -1 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-08 18:09 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Adrian Ratiu, linux-arm-kernel, Nathan Chancellor,
	Nick Desaulniers, Arnd Bergmann, clang-built-linux, Russell King,
	linux-kernel, kernel

On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > happens badly with no gains and considering previous GCC experiences
> > which generated unoptimized code which was worse than the default asm
> > implementation, it is safer to default clang builds to the known good
> > generic implementation.
> > 
> > The kernel currently supports a minimum Clang version of v10.0.1, see
> > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > 
> > When the bug gets eventually fixed, this commit could be reverted or,
> > if the minimum clang version bump takes a long time, a warning could
> > be added for users to upgrade their compilers like was done for GCC.
> > 
> > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > 
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> >  arch/arm/include/asm/xor.h | 3 ++-
> >  arch/arm/lib/Makefile      | 3 +++
> >  arch/arm/lib/xor-neon.c    | 4 ++++
> >  3 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > index aefddec79286..49937dafaa71 100644
> > --- a/arch/arm/include/asm/xor.h
> > +++ b/arch/arm/include/asm/xor.h
> > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> >  		NEON_TEMPLATES;			\
> >  	} while (0)
> >  
> > -#ifdef CONFIG_KERNEL_MODE_NEON
> > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> >  
> >  extern struct xor_block_template const xor_block_neon_inner;
> >  
> > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > index 6d2ba454f25b..53f9e7dd9714 100644
> > --- a/arch/arm/lib/Makefile
> > +++ b/arch/arm/lib/Makefile
> > @@ -43,8 +43,11 @@ endif
> >  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
> >  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
> >  
> > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > +ifndef CONFIG_CC_IS_CLANG
> >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> >    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> >    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
> >    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
> >  endif
> > +endif
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index e1e76186ec23..84c91c48dfa2 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> >   * Pull in the reference implementations while instructing GCC (through
> >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> >   * NEON instructions.
> > +
> > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > + * so xor-neon is disabled in favor of the default reg implementations.
> >   */
> >  #ifdef CONFIG_CC_IS_GCC
> >  #pragma GCC optimize "tree-vectorize"
> > -- 
> > 2.29.0
> > 
> 
> It's actually a bad idea to use #pragma GCC optimize. This is basically
> the same as tagging all the functions with __attribute__((optimize)),
> which GCC does not recommend for production use, as it _replaces_
> optimization options rather than appending to them, and has been
> observed to result in dropping important compiler flags.
> 
> There've been a few discussions recently around other such cases:
> https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
> https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/
> 
> For this file, given that it is supposed to use -ftree-vectorize for the
> whole file anyway, is there any reason it's not just added to CFLAGS via
> the Makefile? This seems to be the only use of pragma optimize in the
> kernel.

Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
https://godbolt.org/z/1nfrKT

The first function does not use vectorization because s and s->a might
alias.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-08 18:09       ` Arvind Sankar
  0 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-08 18:09 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Nick Desaulniers, Russell King,
	linux-kernel, clang-built-linux, Nathan Chancellor, kernel,
	linux-arm-kernel

On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > happens badly with no gains and considering previous GCC experiences
> > which generated unoptimized code which was worse than the default asm
> > implementation, it is safer to default clang builds to the known good
> > generic implementation.
> > 
> > The kernel currently supports a minimum Clang version of v10.0.1, see
> > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > 
> > When the bug gets eventually fixed, this commit could be reverted or,
> > if the minimum clang version bump takes a long time, a warning could
> > be added for users to upgrade their compilers like was done for GCC.
> > 
> > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > 
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > ---
> >  arch/arm/include/asm/xor.h | 3 ++-
> >  arch/arm/lib/Makefile      | 3 +++
> >  arch/arm/lib/xor-neon.c    | 4 ++++
> >  3 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > index aefddec79286..49937dafaa71 100644
> > --- a/arch/arm/include/asm/xor.h
> > +++ b/arch/arm/include/asm/xor.h
> > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> >  		NEON_TEMPLATES;			\
> >  	} while (0)
> >  
> > -#ifdef CONFIG_KERNEL_MODE_NEON
> > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> >  
> >  extern struct xor_block_template const xor_block_neon_inner;
> >  
> > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > index 6d2ba454f25b..53f9e7dd9714 100644
> > --- a/arch/arm/lib/Makefile
> > +++ b/arch/arm/lib/Makefile
> > @@ -43,8 +43,11 @@ endif
> >  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
> >  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
> >  
> > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > +ifndef CONFIG_CC_IS_CLANG
> >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> >    NEON_FLAGS			:= -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> >    CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
> >    obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
> >  endif
> > +endif
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index e1e76186ec23..84c91c48dfa2 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> >   * Pull in the reference implementations while instructing GCC (through
> >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> >   * NEON instructions.
> > +
> > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > + * so xor-neon is disabled in favor of the default reg implementations.
> >   */
> >  #ifdef CONFIG_CC_IS_GCC
> >  #pragma GCC optimize "tree-vectorize"
> > -- 
> > 2.29.0
> > 
> 
> It's actually a bad idea to use #pragma GCC optimize. This is basically
> the same as tagging all the functions with __attribute__((optimize)),
> which GCC does not recommend for production use, as it _replaces_
> optimization options rather than appending to them, and has been
> observed to result in dropping important compiler flags.
> 
> There've been a few discussions recently around other such cases:
> https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
> https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/
> 
> For this file, given that it is supposed to use -ftree-vectorize for the
> whole file anyway, is there any reason it's not just added to CFLAGS via
> the Makefile? This seems to be the only use of pragma optimize in the
> kernel.

Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
https://godbolt.org/z/1nfrKT

The first function does not use vectorization because s and s->a might
alias.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-08 18:09       ` Arvind Sankar
@ 2020-11-08 20:14         ` Ard Biesheuvel
  -1 siblings, 0 replies; 56+ messages in thread
From: Ard Biesheuvel @ 2020-11-08 20:14 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Nick Desaulniers, Russell King,
	Linux Kernel Mailing List, clang-built-linux, Nathan Chancellor,
	kernel, Linux ARM

On Sun, 8 Nov 2020 at 19:10, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > > happens badly with no gains and considering previous GCC experiences
> > > which generated unoptimized code which was worse than the default asm
> > > implementation, it is safer to default clang builds to the known good
> > > generic implementation.
> > >
> > > The kernel currently supports a minimum Clang version of v10.0.1, see
> > > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > >
> > > When the bug gets eventually fixed, this commit could be reverted or,
> > > if the minimum clang version bump takes a long time, a warning could
> > > be added for users to upgrade their compilers like was done for GCC.
> > >
> > > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > >
> > > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > > ---
> > >  arch/arm/include/asm/xor.h | 3 ++-
> > >  arch/arm/lib/Makefile      | 3 +++
> > >  arch/arm/lib/xor-neon.c    | 4 ++++
> > >  3 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > > index aefddec79286..49937dafaa71 100644
> > > --- a/arch/arm/include/asm/xor.h
> > > +++ b/arch/arm/include/asm/xor.h
> > > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> > >             NEON_TEMPLATES;                 \
> > >     } while (0)
> > >
> > > -#ifdef CONFIG_KERNEL_MODE_NEON
> > > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> > >
> > >  extern struct xor_block_template const xor_block_neon_inner;
> > >
> > > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > > index 6d2ba454f25b..53f9e7dd9714 100644
> > > --- a/arch/arm/lib/Makefile
> > > +++ b/arch/arm/lib/Makefile
> > > @@ -43,8 +43,11 @@ endif
> > >  $(obj)/csumpartialcopy.o:  $(obj)/csumpartialcopygeneric.S
> > >  $(obj)/csumpartialcopyuser.o:      $(obj)/csumpartialcopygeneric.S
> > >
> > > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > > +ifndef CONFIG_CC_IS_CLANG
> > >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> > >    NEON_FLAGS                       := -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> > >    CFLAGS_xor-neon.o                += $(NEON_FLAGS)
> > >    obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> > >  endif
> > > +endif
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index e1e76186ec23..84c91c48dfa2 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> > >   * Pull in the reference implementations while instructing GCC (through
> > >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> > >   * NEON instructions.
> > > +
> > > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > > + * so xor-neon is disabled in favor of the default reg implementations.
> > >   */
> > >  #ifdef CONFIG_CC_IS_GCC
> > >  #pragma GCC optimize "tree-vectorize"
> > > --
> > > 2.29.0
> > >
> >
> > It's actually a bad idea to use #pragma GCC optimize. This is basically
> > the same as tagging all the functions with __attribute__((optimize)),
> > which GCC does not recommend for production use, as it _replaces_
> > optimization options rather than appending to them, and has been
> > observed to result in dropping important compiler flags.
> >
> > There've been a few discussions recently around other such cases:
> > https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
> > https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/
> >
> > For this file, given that it is supposed to use -ftree-vectorize for the
> > whole file anyway, is there any reason it's not just added to CFLAGS via
> > the Makefile? This seems to be the only use of pragma optimize in the
> > kernel.
>
> Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
> https://godbolt.org/z/1nfrKT
>
> The first function does not use vectorization because s and s->a might
> alias.
>

Thanks, Arvind. I wasn't aware of this issue at the time, but I agree
that we should replace the #pragma with a command line option in this
case.

And given that we already set CFLAGS_xor-neon.o in the Makefile,
adding it there would have been more straight-forward to begin with.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-08 20:14         ` Ard Biesheuvel
  0 siblings, 0 replies; 56+ messages in thread
From: Ard Biesheuvel @ 2020-11-08 20:14 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Nick Desaulniers, Russell King,
	Linux Kernel Mailing List, clang-built-linux, Nathan Chancellor,
	kernel, Linux ARM

On Sun, 8 Nov 2020 at 19:10, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > > happens badly with no gains and considering previous GCC experiences
> > > which generated unoptimized code which was worse than the default asm
> > > implementation, it is safer to default clang builds to the known good
> > > generic implementation.
> > >
> > > The kernel currently supports a minimum Clang version of v10.0.1, see
> > > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > >
> > > When the bug gets eventually fixed, this commit could be reverted or,
> > > if the minimum clang version bump takes a long time, a warning could
> > > be added for users to upgrade their compilers like was done for GCC.
> > >
> > > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > >
> > > Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
> > > ---
> > >  arch/arm/include/asm/xor.h | 3 ++-
> > >  arch/arm/lib/Makefile      | 3 +++
> > >  arch/arm/lib/xor-neon.c    | 4 ++++
> > >  3 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > > index aefddec79286..49937dafaa71 100644
> > > --- a/arch/arm/include/asm/xor.h
> > > +++ b/arch/arm/include/asm/xor.h
> > > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> > >             NEON_TEMPLATES;                 \
> > >     } while (0)
> > >
> > > -#ifdef CONFIG_KERNEL_MODE_NEON
> > > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> > >
> > >  extern struct xor_block_template const xor_block_neon_inner;
> > >
> > > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > > index 6d2ba454f25b..53f9e7dd9714 100644
> > > --- a/arch/arm/lib/Makefile
> > > +++ b/arch/arm/lib/Makefile
> > > @@ -43,8 +43,11 @@ endif
> > >  $(obj)/csumpartialcopy.o:  $(obj)/csumpartialcopygeneric.S
> > >  $(obj)/csumpartialcopyuser.o:      $(obj)/csumpartialcopygeneric.S
> > >
> > > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > > +ifndef CONFIG_CC_IS_CLANG
> > >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> > >    NEON_FLAGS                       := -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> > >    CFLAGS_xor-neon.o                += $(NEON_FLAGS)
> > >    obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> > >  endif
> > > +endif
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index e1e76186ec23..84c91c48dfa2 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> > >   * Pull in the reference implementations while instructing GCC (through
> > >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> > >   * NEON instructions.
> > > +
> > > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > > + * so xor-neon is disabled in favor of the default reg implementations.
> > >   */
> > >  #ifdef CONFIG_CC_IS_GCC
> > >  #pragma GCC optimize "tree-vectorize"
> > > --
> > > 2.29.0
> > >
> >
> > It's actually a bad idea to use #pragma GCC optimize. This is basically
> > the same as tagging all the functions with __attribute__((optimize)),
> > which GCC does not recommend for production use, as it _replaces_
> > optimization options rather than appending to them, and has been
> > observed to result in dropping important compiler flags.
> >
> > There've been a few discussions recently around other such cases:
> > https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@kernel.org/
> > https://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net/
> >
> > For this file, given that it is supposed to use -ftree-vectorize for the
> > whole file anyway, is there any reason it's not just added to CFLAGS via
> > the Makefile? This seems to be the only use of pragma optimize in the
> > kernel.
>
> Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
> https://godbolt.org/z/1nfrKT
>
> The first function does not use vectorization because s and s->a might
> alias.
>

Thanks, Arvind. I wasn't aware of this issue at the time, but I agree
that we should replace the #pragma with a command line option in this
case.

And given that we already set CFLAGS_xor-neon.o in the Makefile,
adding it there would have been more straight-forward to begin with.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-06 19:52         ` Nick Desaulniers
@ 2020-11-09 19:53           ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-09 19:53 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> Hi Nathan, 
>> 
>> On Fri, 06 Nov 2020, Nathan Chancellor 
>> <natechancellor@gmail.com> wrote: 
>> > + Ard, who wrote this code. 
>> > 
>> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> >> Due to a Clang bug [1] neon autoloop vectorization does not 
>> >> happen or happens badly with no gains and considering 
>> >> previous GCC experiences which generated unoptimized code 
>> >> which was worse than the default asm implementation, it is 
>> >> safer to default clang builds to the known good generic 
>> >> implementation.  The kernel currently supports a minimum 
>> >> Clang version of v10.0.1, see commit 1f7a44f63e6c 
>> >> ("compiler-clang: add build check for clang 10.0.1").   When 
>> >> the bug gets eventually fixed, this commit could be reverted 
>> >> or, if the minimum clang version bump takes a long time, a 
>> >> warning could be added for users to upgrade their compilers 
>> >> like was done for GCC.   [1] 
>> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> >> Adrian Ratiu <adrian.ratiu@collabora.com> 
>> > 
>> > Thank you for the patch! We are also tracking this here: 
>> > 
>> > https://github.com/ClangBuiltLinux/linux/issues/496 
>> > 
>> > It was on my TODO to revist getting the warning eliminated, 
>> > which likely would have involved a patch like this as well. 
>> > 
>> > I am curious if it is worth revisting or dusting off Arnd's 
>> > patch in the LLVM bug tracker first. I have not tried it 
>> > personally. If that is not a worthwhile option, I am fine 
>> > with this for now. It would be nice to try and get a fix 
>> > pinned down on the LLVM side at some point but alas, finite 
>> > amount of resources and people :( 
>> 
>> I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
>> with the Clang v10.0.1 I still get warnings like the following 
>> even though the __restrict workaround seems to affect the 
>> generated instructions: 
>> 
>> ./include/asm-generic/xor.h:15:2: remark: the cost-model 
>> indicates that interleaving is not beneficial 
>> [-Rpass-missed=loop-vectorize] 
>> ./include/asm-generic/xor.h:11:1: remark: List vectorization 
>> was possible but not beneficial with cost 0 >= 0 
>> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
>> unsigned long *__restrict p1, unsigned long *__restrict p2) 
> 
> If it's just a matter of overruling the cost model #pragma clang 
> loop vectorize(enable) 
> 
> will do the trick. 
> 
> Indeed, ``` diff --git a/include/asm-generic/xor.h 
> b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7 
> 100644 --- a/include/asm-generic/xor.h +++ 
> b/include/asm-generic/xor.h @@ -12,6 +12,7 @@ 
> xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned 
> long *p2) 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0]; p1[1] ^= p2[1]; 
> @@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1]; 
> @@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^ 
>                 p3[1] ^ p4[1]; 
> @@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^= 
>                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
> ``` seems to generate the vectorized code. 
> 
> Why don't we find a way to make those pragma's more toolchain 
> portable, rather than open coding them like I have above rather 
> than this series? 

Hi again Nick,

How did you verify the above pragmas generate correct vectorized 
code?  Have you tested this specific use case?

I'm asking because overrulling the cost model might not be enough, 
the only thing I can confirm is that the generated code is 
changed, but not that it is correct in any way. The object disasm 
also looks weird, but I don't have enough knowledge to start 
debugging what's happening within LLVM/Clang itself.

I also get some new warnings with your code [1], besides the 
previously 'vectorization was possible but not beneficial' which 
is still present. It is quite funny because these two warnings 
seem to contradict themselves. :)

At this point I do not trust the compiler and am inclined to do 
like was done for GCC when it was broken: disable the optimization 
and warn users to upgrade after the compiler is fixed and 
confirmed to work.

If you agree I can send a v2 with this and also drop the GCC 
pragma as Arvind and Ard suggested.

Kind regards,
Adrian

[1]
./include/asm-generic/xor.h:11:1: warning: loop not vectorized: 
the optimizer was unable to perform the requested transformation; 
the transformation might be disabled or specified as part of an 
unsupported transformation ordering 
[-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes, 
unsigned long *p1, unsigned long *p2) 


>
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-09 19:53           ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-09 19:53 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> Hi Nathan, 
>> 
>> On Fri, 06 Nov 2020, Nathan Chancellor 
>> <natechancellor@gmail.com> wrote: 
>> > + Ard, who wrote this code. 
>> > 
>> > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: 
>> >> Due to a Clang bug [1] neon autoloop vectorization does not 
>> >> happen or happens badly with no gains and considering 
>> >> previous GCC experiences which generated unoptimized code 
>> >> which was worse than the default asm implementation, it is 
>> >> safer to default clang builds to the known good generic 
>> >> implementation.  The kernel currently supports a minimum 
>> >> Clang version of v10.0.1, see commit 1f7a44f63e6c 
>> >> ("compiler-clang: add build check for clang 10.0.1").   When 
>> >> the bug gets eventually fixed, this commit could be reverted 
>> >> or, if the minimum clang version bump takes a long time, a 
>> >> warning could be added for users to upgrade their compilers 
>> >> like was done for GCC.   [1] 
>> >> https://bugs.llvm.org/show_bug.cgi?id=40976  Signed-off-by: 
>> >> Adrian Ratiu <adrian.ratiu@collabora.com> 
>> > 
>> > Thank you for the patch! We are also tracking this here: 
>> > 
>> > https://github.com/ClangBuiltLinux/linux/issues/496 
>> > 
>> > It was on my TODO to revist getting the warning eliminated, 
>> > which likely would have involved a patch like this as well. 
>> > 
>> > I am curious if it is worth revisting or dusting off Arnd's 
>> > patch in the LLVM bug tracker first. I have not tried it 
>> > personally. If that is not a worthwhile option, I am fine 
>> > with this for now. It would be nice to try and get a fix 
>> > pinned down on the LLVM side at some point but alas, finite 
>> > amount of resources and people :( 
>> 
>> I tested Arnd's kernel patch from the LLVM bugtracker [1], but 
>> with the Clang v10.0.1 I still get warnings like the following 
>> even though the __restrict workaround seems to affect the 
>> generated instructions: 
>> 
>> ./include/asm-generic/xor.h:15:2: remark: the cost-model 
>> indicates that interleaving is not beneficial 
>> [-Rpass-missed=loop-vectorize] 
>> ./include/asm-generic/xor.h:11:1: remark: List vectorization 
>> was possible but not beneficial with cost 0 >= 0 
>> [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, 
>> unsigned long *__restrict p1, unsigned long *__restrict p2) 
> 
> If it's just a matter of overruling the cost model #pragma clang 
> loop vectorize(enable) 
> 
> will do the trick. 
> 
> Indeed, ``` diff --git a/include/asm-generic/xor.h 
> b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7 
> 100644 --- a/include/asm-generic/xor.h +++ 
> b/include/asm-generic/xor.h @@ -12,6 +12,7 @@ 
> xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned 
> long *p2) 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0]; p1[1] ^= p2[1]; 
> @@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1]; 
> @@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^ 
>                 p3[1] ^ p4[1]; 
> @@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long 
> *p1, unsigned long *p2, 
>  { 
>         long lines = bytes / (sizeof (long)) / 8; 
> 
> +#pragma clang loop vectorize(enable) 
>         do { 
>                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^= 
>                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
> ``` seems to generate the vectorized code. 
> 
> Why don't we find a way to make those pragma's more toolchain 
> portable, rather than open coding them like I have above rather 
> than this series? 

Hi again Nick,

How did you verify the above pragmas generate correct vectorized 
code?  Have you tested this specific use case?

I'm asking because overrulling the cost model might not be enough, 
the only thing I can confirm is that the generated code is 
changed, but not that it is correct in any way. The object disasm 
also looks weird, but I don't have enough knowledge to start 
debugging what's happening within LLVM/Clang itself.

I also get some new warnings with your code [1], besides the 
previously 'vectorization was possible but not beneficial' which 
is still present. It is quite funny because these two warnings 
seem to contradict themselves. :)

At this point I do not trust the compiler and am inclined to do 
like was done for GCC when it was broken: disable the optimization 
and warn users to upgrade after the compiler is fixed and 
confirmed to work.

If you agree I can send a v2 with this and also drop the GCC 
pragma as Arvind and Ard suggested.

Kind regards,
Adrian

[1]
./include/asm-generic/xor.h:11:1: warning: loop not vectorized: 
the optimizer was unable to perform the requested transformation; 
the transformation might be disabled or specified as part of an 
unsupported transformation ordering 
[-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes, 
unsigned long *p1, unsigned long *p2) 


>
> -- 
> Thanks,
> ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-09 19:53           ` Adrian Ratiu
@ 2020-11-10 21:41             ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 21:41 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> wrote:
> > +#pragma clang loop vectorize(enable)
> >         do {
> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > ``` seems to generate the vectorized code.
> >
> > Why don't we find a way to make those pragma's more toolchain
> > portable, rather than open coding them like I have above rather
> > than this series?
>
> Hi again Nick,
>
> How did you verify the above pragmas generate correct vectorized
> code?  Have you tested this specific use case?

I read the disassembly before and after my suggested use of pragmas;
look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.

>
> I'm asking because overrulling the cost model might not be enough,
> the only thing I can confirm is that the generated code is
> changed, but not that it is correct in any way. The object disasm
> also looks weird, but I don't have enough knowledge to start
> debugging what's happening within LLVM/Clang itself.

It doesn't "look weird" to me. The loop is versioned based on a
comparison whether the parameters alias or not. There's a
non-vectorized version if the parameters are equal or close enough to
overlap.  There's another version of the loop that's vectorized.  If
you want just the vectorized version, then you have to mark the
parameters as __restrict qualified, then check that all callers are ok
with that.

>
> I also get some new warnings with your code [1], besides the
> previously 'vectorization was possible but not beneficial' which
> is still present. It is quite funny because these two warnings
> seem to contradict themselves. :)

From which compiler?
```
$ clang -Wpass-failed=transform-warning -c -x c /dev/null
warning: unknown warning option '-Wpass-failed=transform-warning'; did
you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option]
```

The pragma is clang specific, hence my recommendation to wrap it in an
#ifdef __clang__.

>
> At this point I do not trust the compiler and am inclined to do

Nonsense.

> like was done for GCC when it was broken: disable the optimization
> and warn users to upgrade after the compiler is fixed and
> confirmed to work.
>
> If you agree I can send a v2 with this and also drop the GCC
> pragma as Arvind and Ard suggested.

If you resend "this" as in 2/2, I will NACK it.  There's nothing wrong
with the cost model; it's saying there's little point in generating
the vectorized version because you're still going to need a
non-vectorized loop version anyways.  Claiming there is a compiler bug
here is dubious just because the cost models between two compilers
differ slightly.

Resend the patch removing the warning, remove the GCC pragma, but if
you want to change anything here for Clang, use `#pragma clang loop
vectorize(enable)` wrapped in an `#ifdef __clang__`.

>
> Kind regards,
> Adrian
>
> [1]
> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
> the optimizer was unable to perform the requested transformation;
> the transformation might be disabled or specified as part of an
> unsupported transformation ordering
> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
> unsigned long *p1, unsigned long *p2)

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 21:41             ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 21:41 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> wrote:
> > +#pragma clang loop vectorize(enable)
> >         do {
> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > ``` seems to generate the vectorized code.
> >
> > Why don't we find a way to make those pragma's more toolchain
> > portable, rather than open coding them like I have above rather
> > than this series?
>
> Hi again Nick,
>
> How did you verify the above pragmas generate correct vectorized
> code?  Have you tested this specific use case?

I read the disassembly before and after my suggested use of pragmas;
look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.

>
> I'm asking because overrulling the cost model might not be enough,
> the only thing I can confirm is that the generated code is
> changed, but not that it is correct in any way. The object disasm
> also looks weird, but I don't have enough knowledge to start
> debugging what's happening within LLVM/Clang itself.

It doesn't "look weird" to me. The loop is versioned based on a
comparison whether the parameters alias or not. There's a
non-vectorized version if the parameters are equal or close enough to
overlap.  There's another version of the loop that's vectorized.  If
you want just the vectorized version, then you have to mark the
parameters as __restrict qualified, then check that all callers are ok
with that.

>
> I also get some new warnings with your code [1], besides the
> previously 'vectorization was possible but not beneficial' which
> is still present. It is quite funny because these two warnings
> seem to contradict themselves. :)

From which compiler?
```
$ clang -Wpass-failed=transform-warning -c -x c /dev/null
warning: unknown warning option '-Wpass-failed=transform-warning'; did
you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option]
```

The pragma is clang specific, hence my recommendation to wrap it in an
#ifdef __clang__.

>
> At this point I do not trust the compiler and am inclined to do

Nonsense.

> like was done for GCC when it was broken: disable the optimization
> and warn users to upgrade after the compiler is fixed and
> confirmed to work.
>
> If you agree I can send a v2 with this and also drop the GCC
> pragma as Arvind and Ard suggested.

If you resend "this" as in 2/2, I will NACK it.  There's nothing wrong
with the cost model; it's saying there's little point in generating
the vectorized version because you're still going to need a
non-vectorized loop version anyways.  Claiming there is a compiler bug
here is dubious just because the cost models between two compilers
differ slightly.

Resend the patch removing the warning, remove the GCC pragma, but if
you want to change anything here for Clang, use `#pragma clang loop
vectorize(enable)` wrapped in an `#ifdef __clang__`.

>
> Kind regards,
> Adrian
>
> [1]
> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
> the optimizer was unable to perform the requested transformation;
> the transformation might be disabled or specified as part of an
> unsupported transformation ordering
> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
> unsigned long *p1, unsigned long *p2)

-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 21:41             ` Nick Desaulniers
@ 2020-11-10 22:15               ` Arvind Sankar
  -1 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-10 22:15 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Adrian Ratiu, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> >
> > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > wrote:
> > > +#pragma clang loop vectorize(enable)
> > >         do {
> > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > ``` seems to generate the vectorized code.
> > >
> > > Why don't we find a way to make those pragma's more toolchain
> > > portable, rather than open coding them like I have above rather
> > > than this series?
> >
> > Hi again Nick,
> >
> > How did you verify the above pragmas generate correct vectorized
> > code?  Have you tested this specific use case?
> 
> I read the disassembly before and after my suggested use of pragmas;
> look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> 

https://godbolt.org/z/1oo9M6

With the __restrict__ keywords added, clang seems to vectorize the loop,
but still reports that vectorization wasn't beneficial -- any idea
what's going on?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 22:15               ` Arvind Sankar
  0 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-10 22:15 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> >
> > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > wrote:
> > > +#pragma clang loop vectorize(enable)
> > >         do {
> > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > ``` seems to generate the vectorized code.
> > >
> > > Why don't we find a way to make those pragma's more toolchain
> > > portable, rather than open coding them like I have above rather
> > > than this series?
> >
> > Hi again Nick,
> >
> > How did you verify the above pragmas generate correct vectorized
> > code?  Have you tested this specific use case?
> 
> I read the disassembly before and after my suggested use of pragmas;
> look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> 

https://godbolt.org/z/1oo9M6

With the __restrict__ keywords added, clang seems to vectorize the loop,
but still reports that vectorization wasn't beneficial -- any idea
what's going on?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 22:15               ` Arvind Sankar
@ 2020-11-10 22:36                 ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:36 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Adrian Ratiu, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > >
> > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > wrote:
> > > > +#pragma clang loop vectorize(enable)
> > > >         do {
> > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > ``` seems to generate the vectorized code.
> > > >
> > > > Why don't we find a way to make those pragma's more toolchain
> > > > portable, rather than open coding them like I have above rather
> > > > than this series?
> > >
> > > Hi again Nick,
> > >
> > > How did you verify the above pragmas generate correct vectorized
> > > code?  Have you tested this specific use case?
> >
> > I read the disassembly before and after my suggested use of pragmas;
> > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> >
>
> https://godbolt.org/z/1oo9M6
>
> With the __restrict__ keywords added, clang seems to vectorize the loop,
> but still reports that vectorization wasn't beneficial -- any idea
> what's going on?

I suspect that loop-vectorize is a higher level pass that relies on
slp-vectorizer for the transform.

$ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
-print-after-all
...
*** IR Dump After SLP Vectorizer ***
(bunch of <4 x i32> types)

If you add -Rpass-missed=slp-vectorizer, observe that the existing
warnings from -Rpass-missed=loop-vectorize disappear; I suspect
loop-vectorize will print a "remark" if passes it calls did not, but
returned some for of error code.

-Rpass=slp-vectorizer shows that it vectorizes two sequences of the
loop, and warns that some third portion (that's
non-immediately-obvious to me) was non beneficial.

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 22:36                 ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:36 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > >
> > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > wrote:
> > > > +#pragma clang loop vectorize(enable)
> > > >         do {
> > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > ``` seems to generate the vectorized code.
> > > >
> > > > Why don't we find a way to make those pragma's more toolchain
> > > > portable, rather than open coding them like I have above rather
> > > > than this series?
> > >
> > > Hi again Nick,
> > >
> > > How did you verify the above pragmas generate correct vectorized
> > > code?  Have you tested this specific use case?
> >
> > I read the disassembly before and after my suggested use of pragmas;
> > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> >
>
> https://godbolt.org/z/1oo9M6
>
> With the __restrict__ keywords added, clang seems to vectorize the loop,
> but still reports that vectorization wasn't beneficial -- any idea
> what's going on?

I suspect that loop-vectorize is a higher level pass that relies on
slp-vectorizer for the transform.

$ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
-print-after-all
...
*** IR Dump After SLP Vectorizer ***
(bunch of <4 x i32> types)

If you add -Rpass-missed=slp-vectorizer, observe that the existing
warnings from -Rpass-missed=loop-vectorize disappear; I suspect
loop-vectorize will print a "remark" if passes it calls did not, but
returned some for of error code.

-Rpass=slp-vectorizer shows that it vectorizes two sequences of the
loop, and warns that some third portion (that's
non-immediately-obvious to me) was non beneficial.

-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 22:36                 ` Nick Desaulniers
@ 2020-11-10 22:39                   ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:39 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Adrian Ratiu, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > >
> > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > wrote:
> > > > > +#pragma clang loop vectorize(enable)
> > > > >         do {
> > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > ``` seems to generate the vectorized code.
> > > > >
> > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > portable, rather than open coding them like I have above rather
> > > > > than this series?
> > > >
> > > > Hi again Nick,
> > > >
> > > > How did you verify the above pragmas generate correct vectorized
> > > > code?  Have you tested this specific use case?
> > >
> > > I read the disassembly before and after my suggested use of pragmas;
> > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > >
> >
> > https://godbolt.org/z/1oo9M6
> >
> > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > but still reports that vectorization wasn't beneficial -- any idea
> > what's going on?

Anyways, it's not safe to make that change in the kernel unless you
can guarantee that callers of these routines do not alias or overlap.

>
> I suspect that loop-vectorize is a higher level pass that relies on
> slp-vectorizer for the transform.
>
> $ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
> -print-after-all
> ...
> *** IR Dump After SLP Vectorizer ***
> (bunch of <4 x i32> types)
>
> If you add -Rpass-missed=slp-vectorizer, observe that the existing
> warnings from -Rpass-missed=loop-vectorize disappear; I suspect
> loop-vectorize will print a "remark" if passes it calls did not, but
> returned some for of error code.
>
> -Rpass=slp-vectorizer shows that it vectorizes two sequences of the
> loop, and warns that some third portion (that's
> non-immediately-obvious to me) was non beneficial.
>
> --
> Thanks,
> ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 22:39                   ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:39 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > >
> > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > wrote:
> > > > > +#pragma clang loop vectorize(enable)
> > > > >         do {
> > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > ``` seems to generate the vectorized code.
> > > > >
> > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > portable, rather than open coding them like I have above rather
> > > > > than this series?
> > > >
> > > > Hi again Nick,
> > > >
> > > > How did you verify the above pragmas generate correct vectorized
> > > > code?  Have you tested this specific use case?
> > >
> > > I read the disassembly before and after my suggested use of pragmas;
> > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > >
> >
> > https://godbolt.org/z/1oo9M6
> >
> > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > but still reports that vectorization wasn't beneficial -- any idea
> > what's going on?

Anyways, it's not safe to make that change in the kernel unless you
can guarantee that callers of these routines do not alias or overlap.

>
> I suspect that loop-vectorize is a higher level pass that relies on
> slp-vectorizer for the transform.
>
> $ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
> -print-after-all
> ...
> *** IR Dump After SLP Vectorizer ***
> (bunch of <4 x i32> types)
>
> If you add -Rpass-missed=slp-vectorizer, observe that the existing
> warnings from -Rpass-missed=loop-vectorize disappear; I suspect
> loop-vectorize will print a "remark" if passes it calls did not, but
> returned some for of error code.
>
> -Rpass=slp-vectorizer shows that it vectorizes two sequences of the
> loop, and warns that some third portion (that's
> non-immediately-obvious to me) was non beneficial.
>
> --
> Thanks,
> ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 22:39                   ` Nick Desaulniers
@ 2020-11-10 22:39                     ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:39 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Adrian Ratiu, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Tue, Nov 10, 2020 at 2:39 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > >
> > > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > > >
> > > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > > wrote:
> > > > > > +#pragma clang loop vectorize(enable)
> > > > > >         do {
> > > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > > ``` seems to generate the vectorized code.
> > > > > >
> > > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > > portable, rather than open coding them like I have above rather
> > > > > > than this series?
> > > > >
> > > > > Hi again Nick,
> > > > >
> > > > > How did you verify the above pragmas generate correct vectorized
> > > > > code?  Have you tested this specific use case?
> > > >
> > > > I read the disassembly before and after my suggested use of pragmas;
> > > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > > >
> > >
> > > https://godbolt.org/z/1oo9M6
> > >
> > > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > > but still reports that vectorization wasn't beneficial -- any idea
> > > what's going on?
>
> Anyways, it's not safe to make that change in the kernel unless you
> can guarantee that callers of these routines do not alias or overlap.

s/callers/parameters passed by callers/

>
> >
> > I suspect that loop-vectorize is a higher level pass that relies on
> > slp-vectorizer for the transform.
> >
> > $ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
> > -print-after-all
> > ...
> > *** IR Dump After SLP Vectorizer ***
> > (bunch of <4 x i32> types)
> >
> > If you add -Rpass-missed=slp-vectorizer, observe that the existing
> > warnings from -Rpass-missed=loop-vectorize disappear; I suspect
> > loop-vectorize will print a "remark" if passes it calls did not, but
> > returned some for of error code.
> >
> > -Rpass=slp-vectorizer shows that it vectorizes two sequences of the
> > loop, and warns that some third portion (that's
> > non-immediately-obvious to me) was non beneficial.
> >
> > --
> > Thanks,
> > ~Nick Desaulniers
>
>
>
> --
> Thanks,
> ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 22:39                     ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-10 22:39 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Tue, Nov 10, 2020 at 2:39 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > >
> > > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > > >
> > > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > > wrote:
> > > > > > +#pragma clang loop vectorize(enable)
> > > > > >         do {
> > > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > > ``` seems to generate the vectorized code.
> > > > > >
> > > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > > portable, rather than open coding them like I have above rather
> > > > > > than this series?
> > > > >
> > > > > Hi again Nick,
> > > > >
> > > > > How did you verify the above pragmas generate correct vectorized
> > > > > code?  Have you tested this specific use case?
> > > >
> > > > I read the disassembly before and after my suggested use of pragmas;
> > > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > > >
> > >
> > > https://godbolt.org/z/1oo9M6
> > >
> > > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > > but still reports that vectorization wasn't beneficial -- any idea
> > > what's going on?
>
> Anyways, it's not safe to make that change in the kernel unless you
> can guarantee that callers of these routines do not alias or overlap.

s/callers/parameters passed by callers/

>
> >
> > I suspect that loop-vectorize is a higher level pass that relies on
> > slp-vectorizer for the transform.
> >
> > $ clang -O2 --target=arm-linux-gnueabi -S -o - foo.c -mfpu=neon -mllvm
> > -print-after-all
> > ...
> > *** IR Dump After SLP Vectorizer ***
> > (bunch of <4 x i32> types)
> >
> > If you add -Rpass-missed=slp-vectorizer, observe that the existing
> > warnings from -Rpass-missed=loop-vectorize disappear; I suspect
> > loop-vectorize will print a "remark" if passes it calls did not, but
> > returned some for of error code.
> >
> > -Rpass=slp-vectorizer shows that it vectorizes two sequences of the
> > loop, and warns that some third portion (that's
> > non-immediately-obvious to me) was non beneficial.
> >
> > --
> > Thanks,
> > ~Nick Desaulniers
>
>
>
> --
> Thanks,
> ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 22:39                     ` Nick Desaulniers
@ 2020-11-10 22:54                       ` Arvind Sankar
  -1 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-10 22:54 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arvind Sankar, Adrian Ratiu, Nathan Chancellor, Arnd Bergmann,
	Linux ARM, clang-built-linux, Russell King, LKML,
	Collabora Kernel ML, Ard Biesheuvel

On Tue, Nov 10, 2020 at 02:39:59PM -0800, Nick Desaulniers wrote:
> On Tue, Nov 10, 2020 at 2:39 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > > > >
> > > > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > > > wrote:
> > > > > > > +#pragma clang loop vectorize(enable)
> > > > > > >         do {
> > > > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > > > ``` seems to generate the vectorized code.
> > > > > > >
> > > > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > > > portable, rather than open coding them like I have above rather
> > > > > > > than this series?
> > > > > >
> > > > > > Hi again Nick,
> > > > > >
> > > > > > How did you verify the above pragmas generate correct vectorized
> > > > > > code?  Have you tested this specific use case?
> > > > >
> > > > > I read the disassembly before and after my suggested use of pragmas;
> > > > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > > > >
> > > >
> > > > https://godbolt.org/z/1oo9M6
> > > >
> > > > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > > > but still reports that vectorization wasn't beneficial -- any idea
> > > > what's going on?
> >
> > Anyways, it's not safe to make that change in the kernel unless you
> > can guarantee that callers of these routines do not alias or overlap.
> 
> s/callers/parameters passed by callers/
> 

Yep, but that seems likely, it doesn't seem like the function would do
anything useful if the destination overlapped one of the sources. The
kernel just doesn't seem to make use of the restrict keyword.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 22:54                       ` Arvind Sankar
  0 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-10 22:54 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Arvind Sankar, Nathan Chancellor,
	Collabora Kernel ML, Ard Biesheuvel, Linux ARM

On Tue, Nov 10, 2020 at 02:39:59PM -0800, Nick Desaulniers wrote:
> On Tue, Nov 10, 2020 at 2:39 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
> > > > > >
> > > > > > On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > > > > > wrote:
> > > > > > > +#pragma clang loop vectorize(enable)
> > > > > > >         do {
> > > > > > >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > > > >                 p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > > > ``` seems to generate the vectorized code.
> > > > > > >
> > > > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > > > portable, rather than open coding them like I have above rather
> > > > > > > than this series?
> > > > > >
> > > > > > Hi again Nick,
> > > > > >
> > > > > > How did you verify the above pragmas generate correct vectorized
> > > > > > code?  Have you tested this specific use case?
> > > > >
> > > > > I read the disassembly before and after my suggested use of pragmas;
> > > > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > > > >
> > > >
> > > > https://godbolt.org/z/1oo9M6
> > > >
> > > > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > > > but still reports that vectorization wasn't beneficial -- any idea
> > > > what's going on?
> >
> > Anyways, it's not safe to make that change in the kernel unless you
> > can guarantee that callers of these routines do not alias or overlap.
> 
> s/callers/parameters passed by callers/
> 

Yep, but that seems likely, it doesn't seem like the function would do
anything useful if the destination overlapped one of the sources. The
kernel just doesn't seem to make use of the restrict keyword.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 21:41             ` Nick Desaulniers
@ 2020-11-10 23:56               ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-10 23:56 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
>> wrote: 
>> > +#pragma clang loop vectorize(enable) 
>> >         do { 
>> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] 
>> >                 ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
>> > ``` seems to generate the vectorized code. 
>> > 
>> > Why don't we find a way to make those pragma's more toolchain 
>> > portable, rather than open coding them like I have above 
>> > rather than this series? 
>> 
>> Hi again Nick, 
>> 
>> How did you verify the above pragmas generate correct 
>> vectorized code?  Have you tested this specific use case? 
> 
> I read the disassembly before and after my suggested use of 
> pragmas; look for vld/vstr.  You can also add 
> -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in 
> arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o with 
> CONFIG_BTRFS enabled. 
> 
>> 
>> I'm asking because overrulling the cost model might not be 
>> enough, the only thing I can confirm is that the generated code 
>> is changed, but not that it is correct in any way. The object 
>> disasm also looks weird, but I don't have enough knowledge to 
>> start debugging what's happening within LLVM/Clang itself. 
> 
> It doesn't "look weird" to me. The loop is versioned based on a 
> comparison whether the parameters alias or not. There's a 
> non-vectorized version if the parameters are equal or close 
> enough to overlap.  There's another version of the loop that's 
> vectorized.  If you want just the vectorized version, then you 
> have to mark the parameters as __restrict qualified, then check 
> that all callers are ok with that. 
> 

Thank you for the explanation, that does make sense now. I'm just 
a compiler optimization noob, sorry. All your help is much 
appreciated.

>> 
>> I also get some new warnings with your code [1], besides the 
>> previously 'vectorization was possible but not beneficial' 
>> which is still present. It is quite funny because these two 
>> warnings seem to contradict themselves. :) 
> 
> From which compiler?  ``` $ clang 
> -Wpass-failed=transform-warning -c -x c /dev/null warning: 
> unknown warning option '-Wpass-failed=transform-warning'; did 
> you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option] 
> ``` 

I'm using Clang 10.0.1-1 from the Arch Linux repo.

In the LLVM sources that transform-warning appears to be 
documented under 
llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning

Here's a build log: http://ix.io/2DIc

I always get those warnings with the pragma change you suggested, 
even on clean builds on latest linux-next.

I looked at the Arch PKGBUILD and they don't appear to do anything 
special other than patching to enable SSP and PIE by default (eg 
llvm bug 13410).

> 
> The pragma is clang specific, hence my recommendation to wrap it 
> in an #ifdef __clang__. 
>

Yes, I understand that. :)
 
>> 
>> At this point I do not trust the compiler and am inclined to do 
> 
> Nonsense. 
> 
>> like was done for GCC when it was broken: disable the 
>> optimization and warn users to upgrade after the compiler is 
>> fixed and confirmed to work. 
>> 
>> If you agree I can send a v2 with this and also drop the GCC 
>> pragma as Arvind and Ard suggested. 
> 
> If you resend "this" as in 2/2, I will NACK it.  There's nothing 
> wrong with the cost model; it's saying there's little point in 
> generating the vectorized version because you're still going to 
> need a non-vectorized loop version anyways.  Claiming there is a 
> compiler bug here is dubious just because the cost models 
> between two compilers differ slightly.

Ok, so that "remark" from the compiler is safe to ignore.

> 
> Resend the patch removing the warning, remove the GCC pragma, 
> but if you want to change anything here for Clang, use `#pragma 
> clang loop vectorize(enable)` wrapped in an `#ifdef __clang__`. 
>

Thanks for making the NACK clear, so the way forward is to either 
use the pragma if I can figure out the new 'loop not vectorized' 
warning (which might also be a red herring) or just leave Clang as 
is. :)

>>
>> Kind regards,
>> Adrian
>>
>> [1]
>> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
>> the optimizer was unable to perform the requested transformation;
>> the transformation might be disabled or specified as part of an
>> unsupported transformation ordering
>> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
>> unsigned long *p1, unsigned long *p2)
>
>
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-10 23:56               ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-10 23:56 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
>> wrote: 
>> > +#pragma clang loop vectorize(enable) 
>> >         do { 
>> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] 
>> >                 ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
>> > ``` seems to generate the vectorized code. 
>> > 
>> > Why don't we find a way to make those pragma's more toolchain 
>> > portable, rather than open coding them like I have above 
>> > rather than this series? 
>> 
>> Hi again Nick, 
>> 
>> How did you verify the above pragmas generate correct 
>> vectorized code?  Have you tested this specific use case? 
> 
> I read the disassembly before and after my suggested use of 
> pragmas; look for vld/vstr.  You can also add 
> -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in 
> arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o with 
> CONFIG_BTRFS enabled. 
> 
>> 
>> I'm asking because overrulling the cost model might not be 
>> enough, the only thing I can confirm is that the generated code 
>> is changed, but not that it is correct in any way. The object 
>> disasm also looks weird, but I don't have enough knowledge to 
>> start debugging what's happening within LLVM/Clang itself. 
> 
> It doesn't "look weird" to me. The loop is versioned based on a 
> comparison whether the parameters alias or not. There's a 
> non-vectorized version if the parameters are equal or close 
> enough to overlap.  There's another version of the loop that's 
> vectorized.  If you want just the vectorized version, then you 
> have to mark the parameters as __restrict qualified, then check 
> that all callers are ok with that. 
> 

Thank you for the explanation, that does make sense now. I'm just 
a compiler optimization noob, sorry. All your help is much 
appreciated.

>> 
>> I also get some new warnings with your code [1], besides the 
>> previously 'vectorization was possible but not beneficial' 
>> which is still present. It is quite funny because these two 
>> warnings seem to contradict themselves. :) 
> 
> From which compiler?  ``` $ clang 
> -Wpass-failed=transform-warning -c -x c /dev/null warning: 
> unknown warning option '-Wpass-failed=transform-warning'; did 
> you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option] 
> ``` 

I'm using Clang 10.0.1-1 from the Arch Linux repo.

In the LLVM sources that transform-warning appears to be 
documented under 
llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning

Here's a build log: http://ix.io/2DIc

I always get those warnings with the pragma change you suggested, 
even on clean builds on latest linux-next.

I looked at the Arch PKGBUILD and they don't appear to do anything 
special other than patching to enable SSP and PIE by default (eg 
llvm bug 13410).

> 
> The pragma is clang specific, hence my recommendation to wrap it 
> in an #ifdef __clang__. 
>

Yes, I understand that. :)
 
>> 
>> At this point I do not trust the compiler and am inclined to do 
> 
> Nonsense. 
> 
>> like was done for GCC when it was broken: disable the 
>> optimization and warn users to upgrade after the compiler is 
>> fixed and confirmed to work. 
>> 
>> If you agree I can send a v2 with this and also drop the GCC 
>> pragma as Arvind and Ard suggested. 
> 
> If you resend "this" as in 2/2, I will NACK it.  There's nothing 
> wrong with the cost model; it's saying there's little point in 
> generating the vectorized version because you're still going to 
> need a non-vectorized loop version anyways.  Claiming there is a 
> compiler bug here is dubious just because the cost models 
> between two compilers differ slightly.

Ok, so that "remark" from the compiler is safe to ignore.

> 
> Resend the patch removing the warning, remove the GCC pragma, 
> but if you want to change anything here for Clang, use `#pragma 
> clang loop vectorize(enable)` wrapped in an `#ifdef __clang__`. 
>

Thanks for making the NACK clear, so the way forward is to either 
use the pragma if I can figure out the new 'loop not vectorized' 
warning (which might also be a red herring) or just leave Clang as 
is. :)

>>
>> Kind regards,
>> Adrian
>>
>> [1]
>> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
>> the optimizer was unable to perform the requested transformation;
>> the transformation might be disabled or specified as part of an
>> unsupported transformation ordering
>> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
>> unsigned long *p1, unsigned long *p2)
>
>
> -- 
> Thanks,
> ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-10 23:56               ` Adrian Ratiu
@ 2020-11-11  0:18                 ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-11  0:18 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Tue, Nov 10, 2020 at 3:54 PM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> wrote:
> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu
> > <adrian.ratiu@collabora.com> wrote:
> >>
> >> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> >> wrote:
> >> > +#pragma clang loop vectorize(enable)
> >> >         do {
> >> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1]
> >> >                 ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> >> > ``` seems to generate the vectorized code.
> >> >
> >> > Why don't we find a way to make those pragma's more toolchain
> >> > portable, rather than open coding them like I have above
> >> > rather than this series?
> >>
> >> Hi again Nick,
> >>
> >> How did you verify the above pragmas generate correct
> >> vectorized code?  Have you tested this specific use case?
> >
> > I read the disassembly before and after my suggested use of
> > pragmas; look for vld/vstr.  You can also add
> > -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in
> > arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o with
> > CONFIG_BTRFS enabled.
> >
> >>
> >> I'm asking because overrulling the cost model might not be
> >> enough, the only thing I can confirm is that the generated code
> >> is changed, but not that it is correct in any way. The object
> >> disasm also looks weird, but I don't have enough knowledge to
> >> start debugging what's happening within LLVM/Clang itself.
> >
> > It doesn't "look weird" to me. The loop is versioned based on a
> > comparison whether the parameters alias or not. There's a
> > non-vectorized version if the parameters are equal or close
> > enough to overlap.  There's another version of the loop that's
> > vectorized.  If you want just the vectorized version, then you
> > have to mark the parameters as __restrict qualified, then check
> > that all callers are ok with that.
> >
>
> Thank you for the explanation, that does make sense now. I'm just
> a compiler optimization noob, sorry. All your help is much
> appreciated.

Don't worry about it; you'll get the hang of it in no time, just stick with it.

>
> >>
> >> I also get some new warnings with your code [1], besides the
> >> previously 'vectorization was possible but not beneficial'
> >> which is still present. It is quite funny because these two
> >> warnings seem to contradict themselves. :)
> >
> > From which compiler?  ``` $ clang
> > -Wpass-failed=transform-warning -c -x c /dev/null warning:
> > unknown warning option '-Wpass-failed=transform-warning'; did
> > you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option]
> > ```
>
> I'm using Clang 10.0.1-1 from the Arch Linux repo.
>
> In the LLVM sources that transform-warning appears to be
> documented under
> llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning
>
> Here's a build log: http://ix.io/2DIc
>
> I always get those warnings with the pragma change you suggested,
> even on clean builds on latest linux-next.
>
> I looked at the Arch PKGBUILD and they don't appear to do anything
> special other than patching to enable SSP and PIE by default (eg
> llvm bug 13410).

Ah, custom builds of LLVM.  Grepping for transform-warning in LLVM's
sources, I can indeed see such a pass. I'm curious whether Arch is
turning on that pass by default or if you manually enabled
-Wpass-failed=transform-warning in the Makefile?  Maybe I need to do
an assertions enabled build of LLVM or a debug build. Reading through
llvm/docs/Passes.rst and llvm/docs/TransformMetadata.rst, it sounds
like this should be triggered when a "forced optimization has failed."
So I wonder what's the missing variable between it working for me, vs
warning for you?

Godbolt seems to agree with me here: https://godbolt.org/z/Wf6YKv.
Maybe related to the "New Pass Manager" ... digging into that...

>
> >
> > The pragma is clang specific, hence my recommendation to wrap it
> > in an #ifdef __clang__.
> >
>
> Yes, I understand that. :)
>
> >>
> >> At this point I do not trust the compiler and am inclined to do
> >
> > Nonsense.
> >
> >> like was done for GCC when it was broken: disable the
> >> optimization and warn users to upgrade after the compiler is
> >> fixed and confirmed to work.
> >>
> >> If you agree I can send a v2 with this and also drop the GCC
> >> pragma as Arvind and Ard suggested.
> >
> > If you resend "this" as in 2/2, I will NACK it.  There's nothing
> > wrong with the cost model; it's saying there's little point in
> > generating the vectorized version because you're still going to
> > need a non-vectorized loop version anyways.  Claiming there is a
> > compiler bug here is dubious just because the cost models
> > between two compilers differ slightly.
>
> Ok, so that "remark" from the compiler is safe to ignore.

Are you always seeing it when building with the pragma's added, no
change to CFLAGS_xor-neon.o in arch/arm/lib/Makefile?

>
> >
> > Resend the patch removing the warning, remove the GCC pragma,
> > but if you want to change anything here for Clang, use `#pragma
> > clang loop vectorize(enable)` wrapped in an `#ifdef __clang__`.
> >
>
> Thanks for making the NACK clear, so the way forward is to either
> use the pragma if I can figure out the new 'loop not vectorized'
> warning (which might also be a red herring) or just leave Clang as
> is. :)

Yes, though additionally Arvind points out that this code is kind of
curious if there was overlap; maybe the parameters should just be
restrict-qualified.

>
> >>
> >> Kind regards,
> >> Adrian
> >>
> >> [1]
> >> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
> >> the optimizer was unable to perform the requested transformation;
> >> the transformation might be disabled or specified as part of an
> >> unsupported transformation ordering
> >> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
> >> unsigned long *p1, unsigned long *p2)

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-11  0:18                 ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-11  0:18 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Tue, Nov 10, 2020 at 3:54 PM Adrian Ratiu <adrian.ratiu@collabora.com> wrote:
>
> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> wrote:
> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu
> > <adrian.ratiu@collabora.com> wrote:
> >>
> >> On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> >> wrote:
> >> > +#pragma clang loop vectorize(enable)
> >> >         do {
> >> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1]
> >> >                 ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> >> > ``` seems to generate the vectorized code.
> >> >
> >> > Why don't we find a way to make those pragma's more toolchain
> >> > portable, rather than open coding them like I have above
> >> > rather than this series?
> >>
> >> Hi again Nick,
> >>
> >> How did you verify the above pragmas generate correct
> >> vectorized code?  Have you tested this specific use case?
> >
> > I read the disassembly before and after my suggested use of
> > pragmas; look for vld/vstr.  You can also add
> > -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in
> > arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o with
> > CONFIG_BTRFS enabled.
> >
> >>
> >> I'm asking because overrulling the cost model might not be
> >> enough, the only thing I can confirm is that the generated code
> >> is changed, but not that it is correct in any way. The object
> >> disasm also looks weird, but I don't have enough knowledge to
> >> start debugging what's happening within LLVM/Clang itself.
> >
> > It doesn't "look weird" to me. The loop is versioned based on a
> > comparison whether the parameters alias or not. There's a
> > non-vectorized version if the parameters are equal or close
> > enough to overlap.  There's another version of the loop that's
> > vectorized.  If you want just the vectorized version, then you
> > have to mark the parameters as __restrict qualified, then check
> > that all callers are ok with that.
> >
>
> Thank you for the explanation, that does make sense now. I'm just
> a compiler optimization noob, sorry. All your help is much
> appreciated.

Don't worry about it; you'll get the hang of it in no time, just stick with it.

>
> >>
> >> I also get some new warnings with your code [1], besides the
> >> previously 'vectorization was possible but not beneficial'
> >> which is still present. It is quite funny because these two
> >> warnings seem to contradict themselves. :)
> >
> > From which compiler?  ``` $ clang
> > -Wpass-failed=transform-warning -c -x c /dev/null warning:
> > unknown warning option '-Wpass-failed=transform-warning'; did
> > you mean '-Wprofile-instr-missing'? [-Wunknown-warning-option]
> > ```
>
> I'm using Clang 10.0.1-1 from the Arch Linux repo.
>
> In the LLVM sources that transform-warning appears to be
> documented under
> llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning
>
> Here's a build log: http://ix.io/2DIc
>
> I always get those warnings with the pragma change you suggested,
> even on clean builds on latest linux-next.
>
> I looked at the Arch PKGBUILD and they don't appear to do anything
> special other than patching to enable SSP and PIE by default (eg
> llvm bug 13410).

Ah, custom builds of LLVM.  Grepping for transform-warning in LLVM's
sources, I can indeed see such a pass. I'm curious whether Arch is
turning on that pass by default or if you manually enabled
-Wpass-failed=transform-warning in the Makefile?  Maybe I need to do
an assertions enabled build of LLVM or a debug build. Reading through
llvm/docs/Passes.rst and llvm/docs/TransformMetadata.rst, it sounds
like this should be triggered when a "forced optimization has failed."
So I wonder what's the missing variable between it working for me, vs
warning for you?

Godbolt seems to agree with me here: https://godbolt.org/z/Wf6YKv.
Maybe related to the "New Pass Manager" ... digging into that...

>
> >
> > The pragma is clang specific, hence my recommendation to wrap it
> > in an #ifdef __clang__.
> >
>
> Yes, I understand that. :)
>
> >>
> >> At this point I do not trust the compiler and am inclined to do
> >
> > Nonsense.
> >
> >> like was done for GCC when it was broken: disable the
> >> optimization and warn users to upgrade after the compiler is
> >> fixed and confirmed to work.
> >>
> >> If you agree I can send a v2 with this and also drop the GCC
> >> pragma as Arvind and Ard suggested.
> >
> > If you resend "this" as in 2/2, I will NACK it.  There's nothing
> > wrong with the cost model; it's saying there's little point in
> > generating the vectorized version because you're still going to
> > need a non-vectorized loop version anyways.  Claiming there is a
> > compiler bug here is dubious just because the cost models
> > between two compilers differ slightly.
>
> Ok, so that "remark" from the compiler is safe to ignore.

Are you always seeing it when building with the pragma's added, no
change to CFLAGS_xor-neon.o in arch/arm/lib/Makefile?

>
> >
> > Resend the patch removing the warning, remove the GCC pragma,
> > but if you want to change anything here for Clang, use `#pragma
> > clang loop vectorize(enable)` wrapped in an `#ifdef __clang__`.
> >
>
> Thanks for making the NACK clear, so the way forward is to either
> use the pragma if I can figure out the new 'loop not vectorized'
> warning (which might also be a red herring) or just leave Clang as
> is. :)

Yes, though additionally Arvind points out that this code is kind of
curious if there was overlap; maybe the parameters should just be
restrict-qualified.

>
> >>
> >> Kind regards,
> >> Adrian
> >>
> >> [1]
> >> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
> >> the optimizer was unable to perform the requested transformation;
> >> the transformation might be disabled or specified as part of an
> >> unsupported transformation ordering
> >> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
> >> unsigned long *p1, unsigned long *p2)

-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-11  0:18                 ` Nick Desaulniers
@ 2020-11-11 14:15                   ` Adrian Ratiu
  -1 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-11 14:15 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Nathan Chancellor, Arnd Bergmann, Linux ARM, clang-built-linux,
	Russell King, LKML, Collabora Kernel ML, Ard Biesheuvel

On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Tue, Nov 10, 2020 at 3:54 PM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
>> wrote: 
>> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu 
>> > <adrian.ratiu@collabora.com> wrote: 
>> >> 
>> >> On Fri, 06 Nov 2020, Nick Desaulniers 
>> >> <ndesaulniers@google.com> wrote: 
>> >> > +#pragma clang loop vectorize(enable) 
>> >> >         do { 
>> >> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; 
>> >> >                 p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
>> >> > ``` seems to generate the vectorized code. 
>> >> > 
>> >> > Why don't we find a way to make those pragma's more 
>> >> > toolchain portable, rather than open coding them like I 
>> >> > have above rather than this series? 
>> >> 
>> >> Hi again Nick, 
>> >> 
>> >> How did you verify the above pragmas generate correct 
>> >> vectorized code?  Have you tested this specific use case? 
>> > 
>> > I read the disassembly before and after my suggested use of 
>> > pragmas; look for vld/vstr.  You can also add 
>> > -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in 
>> > arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o 
>> > with CONFIG_BTRFS enabled. 
>> > 
>> >> 
>> >> I'm asking because overrulling the cost model might not be 
>> >> enough, the only thing I can confirm is that the generated 
>> >> code is changed, but not that it is correct in any way. The 
>> >> object disasm also looks weird, but I don't have enough 
>> >> knowledge to start debugging what's happening within 
>> >> LLVM/Clang itself. 
>> > 
>> > It doesn't "look weird" to me. The loop is versioned based on 
>> > a comparison whether the parameters alias or not. There's a 
>> > non-vectorized version if the parameters are equal or close 
>> > enough to overlap.  There's another version of the loop 
>> > that's vectorized.  If you want just the vectorized version, 
>> > then you have to mark the parameters as __restrict qualified, 
>> > then check that all callers are ok with that. 
>> > 
>> 
>> Thank you for the explanation, that does make sense now. I'm 
>> just a compiler optimization noob, sorry. All your help is much 
>> appreciated. 
> 
> Don't worry about it; you'll get the hang of it in no time, just 
> stick with it. 
> 
>> 
>> >> 
>> >> I also get some new warnings with your code [1], besides the 
>> >> previously 'vectorization was possible but not beneficial' 
>> >> which is still present. It is quite funny because these two 
>> >> warnings seem to contradict themselves. :) 
>> > 
>> > From which compiler?  ``` $ clang 
>> > -Wpass-failed=transform-warning -c -x c /dev/null warning: 
>> > unknown warning option '-Wpass-failed=transform-warning'; did 
>> > you mean '-Wprofile-instr-missing'? 
>> > [-Wunknown-warning-option] ``` 
>> 
>> I'm using Clang 10.0.1-1 from the Arch Linux repo. 
>> 
>> In the LLVM sources that transform-warning appears to be 
>> documented under 
>> llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning 
>> 
>> Here's a build log: http://ix.io/2DIc 
>> 
>> I always get those warnings with the pragma change you 
>> suggested, even on clean builds on latest linux-next. 
>> 
>> I looked at the Arch PKGBUILD and they don't appear to do 
>> anything special other than patching to enable SSP and PIE by 
>> default (eg llvm bug 13410). 
> 
> Ah, custom builds of LLVM.  Grepping for transform-warning in 
> LLVM's sources, I can indeed see such a pass. I'm curious 
> whether Arch is turning on that pass by default or if you 
> manually enabled -Wpass-failed=transform-warning in the 
> Makefile?  Maybe I need to do an assertions enabled build of 
> LLVM or a debug build. Reading through llvm/docs/Passes.rst and 
> llvm/docs/TransformMetadata.rst, it sounds like this should be 
> triggered when a "forced optimization has failed."  So I wonder 
> what's the missing variable between it working for me, vs 
> warning for you?

I did not build clang myself, just did "pacman -S clang" to get 
the official distro binary package. Here's the PKGBUILD they used, 
I'm sending the commit link because recently clang 11 was upgraded 
to.

I also tested clang 11.0.0 where I get the same warnings / 
remarks.

https://github.com/archlinux/svntogit-packages/blob/8ff1bb4e4be5c6e5bede60c6b259a89f0cee6e6a/trunk/PKGBUILD
 
> 
> Godbolt seems to agree with me here: 
> https://godbolt.org/z/Wf6YKv.  Maybe related to the "New Pass 
> Manager" ... digging into that... 
> 
>> 
>> > 
>> > The pragma is clang specific, hence my recommendation to wrap 
>> > it in an #ifdef __clang__. 
>> > 
>> 
>> Yes, I understand that. :) 
>> 
>> >> 
>> >> At this point I do not trust the compiler and am inclined to 
>> >> do 
>> > 
>> > Nonsense. 
>> > 
>> >> like was done for GCC when it was broken: disable the 
>> >> optimization and warn users to upgrade after the compiler is 
>> >> fixed and confirmed to work. 
>> >> 
>> >> If you agree I can send a v2 with this and also drop the GCC 
>> >> pragma as Arvind and Ard suggested. 
>> > 
>> > If you resend "this" as in 2/2, I will NACK it.  There's 
>> > nothing wrong with the cost model; it's saying there's little 
>> > point in generating the vectorized version because you're 
>> > still going to need a non-vectorized loop version anyways. 
>> > Claiming there is a compiler bug here is dubious just because 
>> > the cost models between two compilers differ slightly. 
>> 
>> Ok, so that "remark" from the compiler is safe to ignore. 
> 
> Are you always seeing it when building with the pragma's added, 
> no change to CFLAGS_xor-neon.o in arch/arm/lib/Makefile? 
> 

No, I have to modify CFLAGS_xor-neon.o to see the remarks. If I do 
a build with just the pragma change I only always get the 
warnings, not remarks.

Here's a more complete log with -Rpass-missed='.*' in the 
Makefile, maybe the other remarks in there will help shed some 
light.

http://ix.io/2DMl

>> 
>> > 
>> > Resend the patch removing the warning, remove the GCC pragma, 
>> > but if you want to change anything here for Clang, use 
>> > `#pragma clang loop vectorize(enable)` wrapped in an `#ifdef 
>> > __clang__`. 
>> > 
>> 
>> Thanks for making the NACK clear, so the way forward is to 
>> either use the pragma if I can figure out the new 'loop not 
>> vectorized' warning (which might also be a red herring) or just 
>> leave Clang as is. :) 
> 
> Yes, though additionally Arvind points out that this code is 
> kind of curious if there was overlap; maybe the parameters 
> should just be restrict-qualified. 
>

For now I think I'll just re-send the GCC changes and leave the 
Clang optimization as is, until we better understand what's 
happening and what's the best way to enable it.

>>
>> >>
>> >> Kind regards,
>> >> Adrian
>> >>
>> >> [1]
>> >> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
>> >> the optimizer was unable to perform the requested transformation;
>> >> the transformation might be disabled or specified as part of an
>> >> unsupported transformation ordering
>> >> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
>> >> unsigned long *p1, unsigned long *p2)
>
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-11 14:15                   ` Adrian Ratiu
  0 siblings, 0 replies; 56+ messages in thread
From: Adrian Ratiu @ 2020-11-11 14:15 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Arnd Bergmann, LKML, Russell King, clang-built-linux,
	Nathan Chancellor, Collabora Kernel ML, Ard Biesheuvel,
	Linux ARM

On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
wrote:
> On Tue, Nov 10, 2020 at 3:54 PM Adrian Ratiu 
> <adrian.ratiu@collabora.com> wrote: 
>> 
>> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
>> wrote: 
>> > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu 
>> > <adrian.ratiu@collabora.com> wrote: 
>> >> 
>> >> On Fri, 06 Nov 2020, Nick Desaulniers 
>> >> <ndesaulniers@google.com> wrote: 
>> >> > +#pragma clang loop vectorize(enable) 
>> >> >         do { 
>> >> >                 p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; 
>> >> >                 p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; 
>> >> > ``` seems to generate the vectorized code. 
>> >> > 
>> >> > Why don't we find a way to make those pragma's more 
>> >> > toolchain portable, rather than open coding them like I 
>> >> > have above rather than this series? 
>> >> 
>> >> Hi again Nick, 
>> >> 
>> >> How did you verify the above pragmas generate correct 
>> >> vectorized code?  Have you tested this specific use case? 
>> > 
>> > I read the disassembly before and after my suggested use of 
>> > pragmas; look for vld/vstr.  You can also add 
>> > -Rpass-missed=loop-vectorize to CFLAGS_xor-neon.o in 
>> > arch/arm/lib/Makefile and rebuild arch/arm/lib/xor-neon.o 
>> > with CONFIG_BTRFS enabled. 
>> > 
>> >> 
>> >> I'm asking because overrulling the cost model might not be 
>> >> enough, the only thing I can confirm is that the generated 
>> >> code is changed, but not that it is correct in any way. The 
>> >> object disasm also looks weird, but I don't have enough 
>> >> knowledge to start debugging what's happening within 
>> >> LLVM/Clang itself. 
>> > 
>> > It doesn't "look weird" to me. The loop is versioned based on 
>> > a comparison whether the parameters alias or not. There's a 
>> > non-vectorized version if the parameters are equal or close 
>> > enough to overlap.  There's another version of the loop 
>> > that's vectorized.  If you want just the vectorized version, 
>> > then you have to mark the parameters as __restrict qualified, 
>> > then check that all callers are ok with that. 
>> > 
>> 
>> Thank you for the explanation, that does make sense now. I'm 
>> just a compiler optimization noob, sorry. All your help is much 
>> appreciated. 
> 
> Don't worry about it; you'll get the hang of it in no time, just 
> stick with it. 
> 
>> 
>> >> 
>> >> I also get some new warnings with your code [1], besides the 
>> >> previously 'vectorization was possible but not beneficial' 
>> >> which is still present. It is quite funny because these two 
>> >> warnings seem to contradict themselves. :) 
>> > 
>> > From which compiler?  ``` $ clang 
>> > -Wpass-failed=transform-warning -c -x c /dev/null warning: 
>> > unknown warning option '-Wpass-failed=transform-warning'; did 
>> > you mean '-Wprofile-instr-missing'? 
>> > [-Wunknown-warning-option] ``` 
>> 
>> I'm using Clang 10.0.1-1 from the Arch Linux repo. 
>> 
>> In the LLVM sources that transform-warning appears to be 
>> documented under 
>> llvm-10.0.1.src/docs/Passes.rst:1227:-transform-warning 
>> 
>> Here's a build log: http://ix.io/2DIc 
>> 
>> I always get those warnings with the pragma change you 
>> suggested, even on clean builds on latest linux-next. 
>> 
>> I looked at the Arch PKGBUILD and they don't appear to do 
>> anything special other than patching to enable SSP and PIE by 
>> default (eg llvm bug 13410). 
> 
> Ah, custom builds of LLVM.  Grepping for transform-warning in 
> LLVM's sources, I can indeed see such a pass. I'm curious 
> whether Arch is turning on that pass by default or if you 
> manually enabled -Wpass-failed=transform-warning in the 
> Makefile?  Maybe I need to do an assertions enabled build of 
> LLVM or a debug build. Reading through llvm/docs/Passes.rst and 
> llvm/docs/TransformMetadata.rst, it sounds like this should be 
> triggered when a "forced optimization has failed."  So I wonder 
> what's the missing variable between it working for me, vs 
> warning for you?

I did not build clang myself, just did "pacman -S clang" to get 
the official distro binary package. Here's the PKGBUILD they used, 
I'm sending the commit link because recently clang 11 was upgraded 
to.

I also tested clang 11.0.0 where I get the same warnings / 
remarks.

https://github.com/archlinux/svntogit-packages/blob/8ff1bb4e4be5c6e5bede60c6b259a89f0cee6e6a/trunk/PKGBUILD
 
> 
> Godbolt seems to agree with me here: 
> https://godbolt.org/z/Wf6YKv.  Maybe related to the "New Pass 
> Manager" ... digging into that... 
> 
>> 
>> > 
>> > The pragma is clang specific, hence my recommendation to wrap 
>> > it in an #ifdef __clang__. 
>> > 
>> 
>> Yes, I understand that. :) 
>> 
>> >> 
>> >> At this point I do not trust the compiler and am inclined to 
>> >> do 
>> > 
>> > Nonsense. 
>> > 
>> >> like was done for GCC when it was broken: disable the 
>> >> optimization and warn users to upgrade after the compiler is 
>> >> fixed and confirmed to work. 
>> >> 
>> >> If you agree I can send a v2 with this and also drop the GCC 
>> >> pragma as Arvind and Ard suggested. 
>> > 
>> > If you resend "this" as in 2/2, I will NACK it.  There's 
>> > nothing wrong with the cost model; it's saying there's little 
>> > point in generating the vectorized version because you're 
>> > still going to need a non-vectorized loop version anyways. 
>> > Claiming there is a compiler bug here is dubious just because 
>> > the cost models between two compilers differ slightly. 
>> 
>> Ok, so that "remark" from the compiler is safe to ignore. 
> 
> Are you always seeing it when building with the pragma's added, 
> no change to CFLAGS_xor-neon.o in arch/arm/lib/Makefile? 
> 

No, I have to modify CFLAGS_xor-neon.o to see the remarks. If I do 
a build with just the pragma change I only always get the 
warnings, not remarks.

Here's a more complete log with -Rpass-missed='.*' in the 
Makefile, maybe the other remarks in there will help shed some 
light.

http://ix.io/2DMl

>> 
>> > 
>> > Resend the patch removing the warning, remove the GCC pragma, 
>> > but if you want to change anything here for Clang, use 
>> > `#pragma clang loop vectorize(enable)` wrapped in an `#ifdef 
>> > __clang__`. 
>> > 
>> 
>> Thanks for making the NACK clear, so the way forward is to 
>> either use the pragma if I can figure out the new 'loop not 
>> vectorized' warning (which might also be a red herring) or just 
>> leave Clang as is. :) 
> 
> Yes, though additionally Arvind points out that this code is 
> kind of curious if there was overlap; maybe the parameters 
> should just be restrict-qualified. 
>

For now I think I'll just re-send the GCC changes and leave the 
Clang optimization as is, until we better understand what's 
happening and what's the best way to enable it.

>>
>> >>
>> >> Kind regards,
>> >> Adrian
>> >>
>> >> [1]
>> >> ./include/asm-generic/xor.h:11:1: warning: loop not vectorized:
>> >> the optimizer was unable to perform the requested transformation;
>> >> the transformation might be disabled or specified as part of an
>> >> unsupported transformation ordering
>> >> [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes,
>> >> unsigned long *p1, unsigned long *p2)
>
> -- 
> Thanks,
> ~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-11 14:15                   ` Adrian Ratiu
@ 2020-11-12 21:50                     ` Arvind Sankar
  -1 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-12 21:50 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Nick Desaulniers, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Wed, Nov 11, 2020 at 04:15:59PM +0200, Adrian Ratiu wrote:
> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
> wrote:
> > 
> > Yes, though additionally Arvind points out that this code is 
> > kind of curious if there was overlap; maybe the parameters 
> > should just be restrict-qualified. 
> >
> 
> For now I think I'll just re-send the GCC changes and leave the 
> Clang optimization as is, until we better understand what's 
> happening and what's the best way to enable it.
> 

Note that the __restrict__ keywords also help GCC -- it saves it from
having to emit the non-vectorized version and switch between the two at
runtime. If we can verify it's safe, it's a good thing to add all
around.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-12 21:50                     ` Arvind Sankar
  0 siblings, 0 replies; 56+ messages in thread
From: Arvind Sankar @ 2020-11-12 21:50 UTC (permalink / raw)
  To: Adrian Ratiu
  Cc: Arnd Bergmann, Nick Desaulniers, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Wed, Nov 11, 2020 at 04:15:59PM +0200, Adrian Ratiu wrote:
> On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com> 
> wrote:
> > 
> > Yes, though additionally Arvind points out that this code is 
> > kind of curious if there was overlap; maybe the parameters 
> > should just be restrict-qualified. 
> >
> 
> For now I think I'll just re-send the GCC changes and leave the 
> Clang optimization as is, until we better understand what's 
> happening and what's the best way to enable it.
> 

Note that the __restrict__ keywords also help GCC -- it saves it from
having to emit the non-vectorized version and switch between the two at
runtime. If we can verify it's safe, it's a good thing to add all
around.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
  2020-11-12 21:50                     ` Arvind Sankar
@ 2020-11-12 21:55                       ` Nick Desaulniers
  -1 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-12 21:55 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Adrian Ratiu, Nathan Chancellor, Arnd Bergmann, Linux ARM,
	clang-built-linux, Russell King, LKML, Collabora Kernel ML,
	Ard Biesheuvel

On Thu, Nov 12, 2020 at 1:50 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Wed, Nov 11, 2020 at 04:15:59PM +0200, Adrian Ratiu wrote:
> > On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > wrote:
> > >
> > > Yes, though additionally Arvind points out that this code is
> > > kind of curious if there was overlap; maybe the parameters
> > > should just be restrict-qualified.
> > >
> >
> > For now I think I'll just re-send the GCC changes and leave the
> > Clang optimization as is, until we better understand what's
> > happening and what's the best way to enable it.
> >
>
> Note that the __restrict__ keywords also help GCC -- it saves it from
> having to emit the non-vectorized version and switch between the two at
> runtime. If we can verify it's safe, it's a good thing to add all
> around.

100% agree.  Even a BUILD_BUG_ON or WARN_ON in callers to validate
such an invariant might be nice.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization
@ 2020-11-12 21:55                       ` Nick Desaulniers
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Desaulniers @ 2020-11-12 21:55 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Arnd Bergmann, Adrian Ratiu, Russell King, LKML,
	clang-built-linux, Nathan Chancellor, Collabora Kernel ML,
	Ard Biesheuvel, Linux ARM

On Thu, Nov 12, 2020 at 1:50 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Wed, Nov 11, 2020 at 04:15:59PM +0200, Adrian Ratiu wrote:
> > On Tue, 10 Nov 2020, Nick Desaulniers <ndesaulniers@google.com>
> > wrote:
> > >
> > > Yes, though additionally Arvind points out that this code is
> > > kind of curious if there was overlap; maybe the parameters
> > > should just be restrict-qualified.
> > >
> >
> > For now I think I'll just re-send the GCC changes and leave the
> > Clang optimization as is, until we better understand what's
> > happening and what's the best way to enable it.
> >
>
> Note that the __restrict__ keywords also help GCC -- it saves it from
> having to emit the non-vectorized version and switch between the two at
> runtime. If we can verify it's safe, it's a good thing to add all
> around.

100% agree.  Even a BUILD_BUG_ON or WARN_ON in callers to validate
such an invariant might be nice.
-- 
Thanks,
~Nick Desaulniers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2020-11-12 21:56 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06  5:14 [PATCH 0/2] arm: lib: xor-neon: Remove warn & disble neon vect Adrian Ratiu
2020-11-06  5:14 ` Adrian Ratiu
2020-11-06  5:14 ` [PATCH 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning Adrian Ratiu
2020-11-06  5:14   ` Adrian Ratiu
2020-11-06 14:46   ` Arnd Bergmann
2020-11-06 14:46     ` Arnd Bergmann
2020-11-06 18:03     ` Nathan Chancellor
2020-11-06 18:03       ` Nathan Chancellor
2020-11-06 21:46       ` Arnd Bergmann
2020-11-06 21:46         ` Arnd Bergmann
2020-11-06  5:14 ` [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization Adrian Ratiu
2020-11-06  5:14   ` Adrian Ratiu
2020-11-06 10:14   ` Nathan Chancellor
2020-11-06 10:14     ` Nathan Chancellor
2020-11-06 11:50     ` Adrian Ratiu
2020-11-06 11:50       ` Adrian Ratiu
2020-11-06 18:01       ` Nathan Chancellor
2020-11-06 18:01         ` Nathan Chancellor
2020-11-06 19:52       ` Nick Desaulniers
2020-11-06 19:52         ` Nick Desaulniers
2020-11-07 18:07         ` Adrian Ratiu
2020-11-07 18:07           ` Adrian Ratiu
2020-11-09 19:53         ` Adrian Ratiu
2020-11-09 19:53           ` Adrian Ratiu
2020-11-10 21:41           ` Nick Desaulniers
2020-11-10 21:41             ` Nick Desaulniers
2020-11-10 22:15             ` Arvind Sankar
2020-11-10 22:15               ` Arvind Sankar
2020-11-10 22:36               ` Nick Desaulniers
2020-11-10 22:36                 ` Nick Desaulniers
2020-11-10 22:39                 ` Nick Desaulniers
2020-11-10 22:39                   ` Nick Desaulniers
2020-11-10 22:39                   ` Nick Desaulniers
2020-11-10 22:39                     ` Nick Desaulniers
2020-11-10 22:54                     ` Arvind Sankar
2020-11-10 22:54                       ` Arvind Sankar
2020-11-10 23:56             ` Adrian Ratiu
2020-11-10 23:56               ` Adrian Ratiu
2020-11-11  0:18               ` Nick Desaulniers
2020-11-11  0:18                 ` Nick Desaulniers
2020-11-11 14:15                 ` Adrian Ratiu
2020-11-11 14:15                   ` Adrian Ratiu
2020-11-12 21:50                   ` Arvind Sankar
2020-11-12 21:50                     ` Arvind Sankar
2020-11-12 21:55                     ` Nick Desaulniers
2020-11-12 21:55                       ` Nick Desaulniers
2020-11-07 10:22   ` Russell King - ARM Linux admin
2020-11-07 10:22     ` Russell King - ARM Linux admin
2020-11-07 18:12     ` Adrian Ratiu
2020-11-07 18:12       ` Adrian Ratiu
2020-11-08 17:40   ` Arvind Sankar
2020-11-08 17:40     ` Arvind Sankar
2020-11-08 18:09     ` Arvind Sankar
2020-11-08 18:09       ` Arvind Sankar
2020-11-08 20:14       ` Ard Biesheuvel
2020-11-08 20:14         ` Ard Biesheuvel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.