linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v0 1/5] x86_64: march=native support
@ 2017-12-07 22:41 Alexey Dobriyan
  2017-12-07 22:41 ` [PATCH 2/5] -march=native: POPCNT support Alexey Dobriyan
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-07 22:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, hpa, Alexey Dobriyan

Being Gentoo user part of me died every time I compiled kernel
with raw -O2 when userspace was running with "-march=native -O2" for years.

This patch implements kernel build with "-march=native", at last.
So far resulting kernel is good enough to boot in VM.

Benchmarks:
	No serious benchmarking was done yet. :-(

Random microbenchmarking indicates that a) SHLX et al enabled SHA-1 can be
~10% faster than regular one as there are no carry flags dependencies and
b) REP STOSB clear_page() can be ~15% faster then REP STOSQ one where
fast REP STOSB is advertised. This is actually important because
clear_page()/copy_page() are regularly seen on top of kernel profiles.

Code size:
SHLX et al bloat kernel quite a lot as these new instructions live in
extended opcode space. However, this is compensated by telling gcc to use
REP STOSB/MOVSB. gcc loves to unroll memset/memcpy to ungodly amounts.

These 2 effects roughly compensate each other: shifts and memset/memcpy
are everywhere.

Regardless, code size in not the objective of this patch, performance is.

Support status:
	x86_64 only (didn't run 386 for a long time)
	Intel only (never owner AMD box)

TODO:
	foolproof protection
	SSE2/AVX/AVX2/AVX-512 disabling (-mno-...)
	.config injection
	BMI2 for %08x/%016lx
	faster clear_user()
	RAID functions (ungodly unrolling, requires lots of courage)
	BPF JIT
	and of course more instructions which kernel is forced to ignore
	because generic kernels

If you want to try it out:
* make sure this kernel is only used on machine which it is compiled at
* grab gcc with "-march=native" support (modern ones have it)
* select CONFIG_MARCH_NATIVE in the CPU choice menu
* add "unexpected options" to scripts/march-native.sh until checks pass
* verify CONFIG_MARCH_NATIVE options in .config, include/config/auto.conf
  and include/generated/autoconf.h
* cross fingers, recompile and reboot

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 Makefile                   |  4 ++
 arch/x86/Kconfig.cpu       |  8 ++++
 scripts/kconfig/.gitignore |  1 +
 scripts/kconfig/Makefile   |  9 ++++-
 scripts/kconfig/cpuid.c    | 76 ++++++++++++++++++++++++++++++++++++
 scripts/march-native.sh    | 96 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 scripts/kconfig/cpuid.c
 create mode 100755 scripts/march-native.sh

diff --git a/Makefile b/Makefile
index 86bb80540cbd..c1cc730b81a8 100644
--- a/Makefile
+++ b/Makefile
@@ -587,6 +587,10 @@ ifeq ($(dot-config),1)
 # Read in config
 -include include/config/auto.conf
 
+ifdef CONFIG_MARCH_NATIVE
+KBUILD_CFLAGS += -march=native
+endif
+
 ifeq ($(KBUILD_EXTMOD),)
 # Read in dependencies to all Kconfig* files, make sure to run
 # oldconfig if changes are detected.
diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 4493e8c5d1ea..2e4750b6b891 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -287,6 +287,12 @@ config GENERIC_CPU
 	  Generic x86-64 CPU.
 	  Run equally well on all x86-64 CPUs.
 
+config MARCH_NATIVE
+	bool "-march=native"
+	depends on X86_64
+	---help---
+	  -march=native support.
+
 endchoice
 
 config X86_GENERIC
@@ -307,6 +313,7 @@ config X86_INTERNODE_CACHE_SHIFT
 	int
 	default "12" if X86_VSMP
 	default X86_L1_CACHE_SHIFT
+	depends on !MARCH_NATIVE
 
 config X86_L1_CACHE_SHIFT
 	int
@@ -314,6 +321,7 @@ config X86_L1_CACHE_SHIFT
 	default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
 	default "4" if MELAN || M486 || MGEODEGX1
 	default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
+	depends on !MARCH_NATIVE
 
 config X86_PPRO_FENCE
 	bool "PentiumPro memory ordering errata workaround"
diff --git a/scripts/kconfig/.gitignore b/scripts/kconfig/.gitignore
index 51f1c877b543..73ebca4b1888 100644
--- a/scripts/kconfig/.gitignore
+++ b/scripts/kconfig/.gitignore
@@ -14,6 +14,7 @@ gconf.glade.h
 # configuration programs
 #
 conf
+cpuid
 mconf
 nconf
 qconf
diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
index 297c1bf35140..7b43b66d4efa 100644
--- a/scripts/kconfig/Makefile
+++ b/scripts/kconfig/Makefile
@@ -21,24 +21,30 @@ unexport CONFIG_
 
 xconfig: $(obj)/qconf
 	$< $(silent) $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
 gconfig: $(obj)/gconf
 	$< $(silent) $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
 menuconfig: $(obj)/mconf
 	$< $(silent) $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
 config: $(obj)/conf
 	$< $(silent) --oldaskconfig $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
 nconfig: $(obj)/nconf
 	$< $(silent) $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
-silentoldconfig: $(obj)/conf
+silentoldconfig: $(obj)/conf $(obj)/cpuid
 	$(Q)mkdir -p include/config include/generated
 	$(Q)test -e include/generated/autoksyms.h || \
 	    touch   include/generated/autoksyms.h
 	$< $(silent) --$@ $(Kconfig)
+	$(Q)$(srctree)/scripts/march-native.sh $(CC) $(obj)/cpuid
 
 localyesconfig localmodconfig: $(obj)/streamline_config.pl $(obj)/conf
 	$(Q)mkdir -p include/config include/generated
@@ -190,6 +196,7 @@ qconf-objs	:= zconf.tab.o
 gconf-objs	:= gconf.o zconf.tab.o
 
 hostprogs-y := conf nconf mconf kxgettext qconf gconf
+hostprogs-y += cpuid
 
 clean-files	:= qconf.moc .tmp_qtcheck .tmp_gtkcheck
 clean-files	+= zconf.tab.c zconf.lex.c gconf.glade.h
diff --git a/scripts/kconfig/cpuid.c b/scripts/kconfig/cpuid.c
new file mode 100644
index 000000000000..f1983027fe2b
--- /dev/null
+++ b/scripts/kconfig/cpuid.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (c) 2017 Alexey Dobriyan <adobriyan@gmail.com>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+static inline bool streq(const char *s1, const char *s2)
+{
+	return strcmp(s1, s2) == 0;
+}
+
+static inline void cpuid(uint32_t eax0, uint32_t *eax, uint32_t *ecx, uint32_t *edx, uint32_t *ebx)
+{
+	asm volatile (
+		"cpuid"
+		: "=a" (*eax), "=c" (*ecx), "=d" (*edx), "=b" (*ebx)
+		: "0" (eax0)
+	);
+}
+
+static inline void cpuid2(uint32_t eax0, uint32_t ecx0, uint32_t *eax, uint32_t *ecx, uint32_t *edx, uint32_t *ebx)
+{
+	asm volatile (
+		"cpuid"
+		: "=a" (*eax), "=c" (*ecx), "=d" (*edx), "=b" (*ebx)
+		: "0" (eax0), "1" (ecx0)
+	);
+}
+
+static uint32_t eax0_max;
+
+static void intel(void)
+{
+	uint32_t eax, ecx, edx, ebx;
+
+	if (eax0_max >= 1) {
+		cpuid(1, &eax, &ecx, &edx, &ebx);
+//		printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	const char *opt = argv[1];
+	uint32_t eax, ecx, edx, ebx;
+
+	if (argc != 2)
+		return EXIT_FAILURE;
+
+	cpuid(0, &eax, &ecx, &edx, &ebx);
+//	printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
+	eax0_max = eax;
+
+	if (ecx == 0x6c65746e && edx == 0x49656e69 && ebx == 0x756e6547)
+		intel();
+
+#define _(x)	if (streq(opt, #x)) return x ? EXIT_SUCCESS : EXIT_FAILURE
+#undef _
+
+	return EXIT_FAILURE;
+}
diff --git a/scripts/march-native.sh b/scripts/march-native.sh
new file mode 100755
index 000000000000..4f0fc82f7722
--- /dev/null
+++ b/scripts/march-native.sh
@@ -0,0 +1,96 @@
+#!/bin/sh
+# Copyright (c) 2017 Alexey Dobriyan <adobriyan@gmail.com>
+if test "$(uname -m)" != x86_64; then
+	exit 0
+fi
+
+CC="$1"
+CPUID="$2"
+AUTOCONF1="include/config/auto.conf"
+AUTOCONF2="include/generated/autoconf.h"
+
+if ! grep -q -e '^CONFIG_MARCH_NATIVE=y$' .config; then
+	sed -i -e '/CONFIG_MARCH_NATIVE/d' "$AUTOCONF1" "$AUTOCONF2" >/dev/null 2>&1
+	exit 0
+fi
+
+if ! "$CC" -march=native -x c -c -o /dev/null /dev/null >/dev/null 2>&1; then
+	echo "error: unsupported '-march=native' compiler option" >&2
+	exit 1
+fi
+
+_option() {
+	echo "$1=$2"		>>"$AUTOCONF1"
+	echo "#define $1 $2"	>>"$AUTOCONF2"
+}
+
+option() {
+	echo "$1=y"		>>"$AUTOCONF1"
+	echo "#define $1 1"	>>"$AUTOCONF2"
+}
+
+if test ! -f "$AUTOCONF1" -o ! -f "$AUTOCONF2"; then
+	exit 0
+fi
+
+COLLECT_GCC_OPTIONS=$("$CC" -march=native -v -E -x c -c /dev/null 2>&1 | sed -ne '/^COLLECT_GCC_OPTIONS=/{n;p}')
+echo $COLLECT_GCC_OPTIONS
+
+for i in $COLLECT_GCC_OPTIONS; do
+	case $i in
+		*/cc1|-E|-quiet|-v|/dev/null|--param|-fstack-protector*|-mno-*)
+			;;
+
+		# FIXME
+		l1-cache-size=*);;
+		l2-cache-size=*);;
+
+		l1-cache-line-size=64)
+			_option "CONFIG_X86_L1_CACHE_SHIFT"		6
+			_option "CONFIG_X86_INTERNODE_CACHE_SHIFT"	6
+			;;
+
+		-march=broadwell);;
+		-mtune=broadwell);;
+		-march=nehalem);;
+		-mtune=nehalem);;
+
+		-mabm)		option "CONFIG_MARCH_NATIVE_ABM"	;;
+		-madx)		option "CONFIG_MARCH_NATIVE_ADX"	;;
+		-maes)		option "CONFIG_MARCH_NATIVE_AES"	;;
+		-mavx)		option "CONFIG_MARCH_NATIVE_AVX"	;;
+		-mavx2)		option "CONFIG_MARCH_NATIVE_AVX2"	;;
+		-mbmi)		option "CONFIG_MARCH_NATIVE_BMI"	;;
+		-mbmi2)		option "CONFIG_MARCH_NATIVE_BMI2"	;;
+		-mcx16)		option "CONFIG_MARCH_NATIVE_CX16"	;;
+		-mf16c)		option "CONFIG_MARCH_NATIVE_F16C"	;;
+		-mfsgsbase)	option "CONFIG_MARCH_NATIVE_FSGSBASE"	;;
+		-mfma)		option "CONFIG_MARCH_NATIVE_FMA"	;;
+		-mfxsr)		option "CONFIG_MARCH_NATIVE_FXSR"	;;
+		-mhle)		option "CONFIG_MARCH_NATIVE_HLE"	;;
+		-mlzcnt)	option "CONFIG_MARCH_NATIVE_LZCNT"	;;
+		-mmmx)		option "CONFIG_MARCH_NATIVE_MMX"	;;
+		-mmovbe)	option "CONFIG_MARCH_NATIVE_MOVBE"	;;
+		-mpclmul)	option "CONFIG_MARCH_NATIVE_PCLMUL"	;;
+		-mpopcnt)	option "CONFIG_MATCH_NATIVE_POPCNT"	;;
+		-mprfchw)	option "CONFIG_MARCH_NATIVE_PREFETCHW"	;;
+		-mrdrnd)	option "CONFIG_MARCH_NATIVE_RDRND"	;;
+		-mrdseed)	option "CONFIG_MARCH_NATIVE_RDSEED"	;;
+		-mrtm)		option "CONFIG_MARCH_NATIVE_RTM"	;;
+		-msahf)		option "CONFIG_MARCH_NATIVE_SAHF"	;;
+		-msse)		option "CONFIG_MARCH_NATIVE_SSE"	;;
+		-msse2)		option "CONFIG_MARCH_NATIVE_SSE2"	;;
+		-msse3)		option "CONFIG_MARCH_NATIVE_SSE3"	;;
+		-msse4.1)	option "CONFIG_MARCH_NATIVE_SSE4_1"	;;
+		-msse4.2)	option "CONFIG_MARCH_NATIVE_SSE4_2"	;;
+		-mssse3)	option "CONFIG_MARCH_NATIVE_SSSE3"	;;
+		-mxsave)	option "CONFIG_MARCH_NATIVE_XSAVE"	;;
+		-mxsaveopt)	option "CONFIG_MARCH_NATIVE_XSAVEOPT"	;;
+
+		*)
+			echo >&2
+			echo "Unexpected -march=native option: '$i'" >&2
+			echo >&2
+			exit 1
+	esac
+done
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/5] -march=native: POPCNT support
  2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
@ 2017-12-07 22:41 ` Alexey Dobriyan
  2017-12-07 23:07   ` H. Peter Anvin
  2017-12-07 22:41 ` [PATCH 3/5] -march=native: REP MOVSB support Alexey Dobriyan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-07 22:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, hpa, Alexey Dobriyan

Mainline kernel can only generate "popcnt rax, rdi" instruction
with alternative masquareading as function call. Patch allows
to generate all POPCNT variations and inlines hweigth*() family of functions.

	$ objdump  -dr ../obj/vmlinux | grep popcnt
	ffffffff81004f6d:       f3 48 0f b8 c9          popcnt rcx,rcx
	ffffffff81008484:       f3 48 0f b8 03          popcnt rax,QWORD PTR [rbx]
	ffffffff81073aae:       f3 48 0f b8 d8          popcnt rbx,rax
		...

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/arch_hweight.h | 32 ++++++++++++++++++++++++++++++--
 arch/x86/lib/Makefile               |  5 ++++-
 include/linux/bitops.h              |  2 ++
 lib/Makefile                        |  2 ++
 scripts/kconfig/cpuid.c             |  6 ++++++
 scripts/march-native.sh             |  6 +++++-
 6 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
index 34a10b2d5b73..58e4f65d8665 100644
--- a/arch/x86/include/asm/arch_hweight.h
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -2,6 +2,34 @@
 #ifndef _ASM_X86_HWEIGHT_H
 #define _ASM_X86_HWEIGHT_H
 
+#define __HAVE_ARCH_SW_HWEIGHT
+
+#ifdef CONFIG_MARCH_NATIVE_POPCNT
+static inline unsigned int __arch_hweight64(uint64_t x)
+{
+	uint64_t rv;
+	asm ("popcnt %1, %0" : "=r" (rv) : "rm" (x));
+	return rv;
+}
+
+static inline unsigned int __arch_hweight32(uint32_t x)
+{
+	uint32_t rv;
+	asm ("popcnt %1, %0" : "=r" (rv) : "rm" (x));
+	return rv;
+}
+
+static inline unsigned int __arch_hweight16(uint16_t x)
+{
+	return __arch_hweight32(x);
+}
+
+static inline unsigned int __arch_hweight8(uint8_t x)
+{
+	return __arch_hweight32(x);
+}
+#else
+
 #include <asm/cpufeatures.h>
 
 #ifdef CONFIG_64BIT
@@ -18,8 +46,6 @@
 #define REG_OUT "a"
 #endif
 
-#define __HAVE_ARCH_SW_HWEIGHT
-
 static __always_inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
@@ -61,3 +87,5 @@ static __always_inline unsigned long __arch_hweight64(__u64 w)
 #endif /* CONFIG_X86_32 */
 
 #endif
+
+#endif
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 7b181b61170e..c26ad76e7048 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -27,7 +27,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
+obj-y += msr.o msr-reg.o msr-reg-export.o
+ifneq ($(CONFIG_MARCH_NATIVE),y)
+	obj-y += hweight.o
+endif
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 4cac4e1a72ff..ab58fed4ab90 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -26,10 +26,12 @@
 	(((~0ULL) - (1ULL << (l)) + 1) & \
 	 (~0ULL >> (BITS_PER_LONG_LONG - 1 - (h))))
 
+#ifndef CONFIG_MARCH_NATIVE_POPCNT
 extern unsigned int __sw_hweight8(unsigned int w);
 extern unsigned int __sw_hweight16(unsigned int w);
 extern unsigned int __sw_hweight32(unsigned int w);
 extern unsigned long __sw_hweight64(__u64 w);
+#endif
 
 /*
  * Include this here because some architectures need generic_ffs/fls in
diff --git a/lib/Makefile b/lib/Makefile
index d11c48ec8ffd..3867b73721aa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -81,7 +81,9 @@ obj-$(CONFIG_HAS_IOMEM) += iomap_copy.o devres.o
 obj-$(CONFIG_CHECK_SIGNATURE) += check_signature.o
 obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
 
+ifneq ($(CONFIG_MARCH_NATIVE_POPCNT),y)
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+endif
 
 obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
diff --git a/scripts/kconfig/cpuid.c b/scripts/kconfig/cpuid.c
index f1983027fe2b..e565dd446bdf 100644
--- a/scripts/kconfig/cpuid.c
+++ b/scripts/kconfig/cpuid.c
@@ -42,6 +42,8 @@ static inline void cpuid2(uint32_t eax0, uint32_t ecx0, uint32_t *eax, uint32_t
 	);
 }
 
+static bool popcnt	= false;
+
 static uint32_t eax0_max;
 
 static void intel(void)
@@ -51,6 +53,9 @@ static void intel(void)
 	if (eax0_max >= 1) {
 		cpuid(1, &eax, &ecx, &edx, &ebx);
 //		printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
+
+		if (ecx & (1 << 23))
+			popcnt = true;
 	}
 }
 
@@ -70,6 +75,7 @@ int main(int argc, char *argv[])
 		intel();
 
 #define _(x)	if (streq(opt, #x)) return x ? EXIT_SUCCESS : EXIT_FAILURE
+	_(popcnt);
 #undef _
 
 	return EXIT_FAILURE;
diff --git a/scripts/march-native.sh b/scripts/march-native.sh
index 4f0fc82f7722..6641e356b646 100755
--- a/scripts/march-native.sh
+++ b/scripts/march-native.sh
@@ -29,6 +29,10 @@ option() {
 	echo "#define $1 1"	>>"$AUTOCONF2"
 }
 
+if test -x "$CPUID"; then
+	"$CPUID" popcnt		&& option "CONFIG_MARCH_NATIVE_POPCNT"
+fi
+
 if test ! -f "$AUTOCONF1" -o ! -f "$AUTOCONF2"; then
 	exit 0
 fi
@@ -72,7 +76,7 @@ for i in $COLLECT_GCC_OPTIONS; do
 		-mmmx)		option "CONFIG_MARCH_NATIVE_MMX"	;;
 		-mmovbe)	option "CONFIG_MARCH_NATIVE_MOVBE"	;;
 		-mpclmul)	option "CONFIG_MARCH_NATIVE_PCLMUL"	;;
-		-mpopcnt)	option "CONFIG_MATCH_NATIVE_POPCNT"	;;
+		-mpopcnt);;
 		-mprfchw)	option "CONFIG_MARCH_NATIVE_PREFETCHW"	;;
 		-mrdrnd)	option "CONFIG_MARCH_NATIVE_RDRND"	;;
 		-mrdseed)	option "CONFIG_MARCH_NATIVE_RDSEED"	;;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] -march=native: REP MOVSB support
  2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
  2017-12-07 22:41 ` [PATCH 2/5] -march=native: POPCNT support Alexey Dobriyan
@ 2017-12-07 22:41 ` Alexey Dobriyan
  2017-12-07 22:41 ` [PATCH 4/5] -march=native: REP STOSB Alexey Dobriyan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-07 22:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, hpa, Alexey Dobriyan

If CPU advertises fast REP MOVSB, use it.

Inline copy_page() to use only 3 registers across function call
not whole shebang as required by ABI.

Also, tell gcc to use REP MOVSB for memcpy(), this saves terabytes of .text.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 Makefile                             |  3 +++
 arch/x86/include/asm/page_64.h       | 13 +++++++++++++
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++++++
 arch/x86/lib/Makefile                |  5 ++++-
 arch/x86/lib/memcpy_64.S             | 13 +++++++++++++
 arch/x86/xen/xen-pvh.S               |  4 ++++
 scripts/kconfig/cpuid.c              |  9 +++++++++
 scripts/march-native.sh              |  1 +
 8 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index c1cc730b81a8..84abac4c181a 100644
--- a/Makefile
+++ b/Makefile
@@ -590,6 +590,9 @@ ifeq ($(dot-config),1)
 ifdef CONFIG_MARCH_NATIVE
 KBUILD_CFLAGS += -march=native
 endif
+ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+KBUILD_CFLAGS += -mmemcpy-strategy=rep_byte:-1:align
+endif
 
 ifeq ($(KBUILD_EXTMOD),)
 # Read in dependencies to all Kconfig* files, make sure to run
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 4baa6bceb232..c2353661eaf1 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -50,7 +50,20 @@ static inline void clear_page(void *page)
 			   : "memory", "rax", "rcx");
 }
 
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+static __always_inline void copy_page(void *to, void *from)
+{
+	uint32_t len = PAGE_SIZE;
+	asm volatile (
+		"rep movsb"
+		: "+D" (to), "+S" (from), "+c" (len)
+		:
+		: "memory"
+	);
+}
+#else
 void copy_page(void *to, void *from);
+#endif
 
 #ifdef CONFIG_X86_MCE
 #define arch_unmap_kpfn arch_unmap_kpfn
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 307d3bac5f04..6ccfb9a63d5c 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -260,18 +260,33 @@ swap_pages:
 	movq	%rsi, %rax
 
 	movq	%r10, %rdi
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+	mov	$4096, %ecx
+	rep movsb
+#else
 	movl	$512, %ecx
 	rep ; movsq
+#endif
 
 	movq	%rax, %rdi
 	movq	%rdx, %rsi
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+	mov	$4096, %ecx
+	rep movsb
+#else
 	movl	$512, %ecx
 	rep ; movsq
+#endif
 
 	movq	%rdx, %rdi
 	movq	%r10, %rsi
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+	mov	$4096, %ecx
+	rep movsb
+#else
 	movl	$512, %ecx
 	rep ; movsq
+#endif
 
 	lea	PAGE_SIZE(%rax), %rsi
 	jmp	0b
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index c26ad76e7048..8f9460bef2ec 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -45,7 +45,10 @@ endif
 else
         obj-y += iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-        lib-y += clear_page_64.o copy_page_64.o
+        lib-y += clear_page_64.o
+ifneq ($(CONFIG_MARCH_NATIVE_REP_MOVSB),y)
+	lib-y += copy_page_64.o
+endif
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o
 	lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 9a53a06e5a3e..28a807eb9bd6 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -15,6 +15,18 @@
 
 .weak memcpy
 
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+ENTRY(__memcpy)
+ENTRY(memcpy)
+	mov	%rdi, %rax
+	mov	%rdx, %rcx
+	rep movsb
+	ret
+ENDPROC(memcpy)
+ENDPROC(__memcpy)
+EXPORT_SYMBOL(memcpy)
+EXPORT_SYMBOL(__memcpy)
+#else
 /*
  * memcpy - Copy a memory block.
  *
@@ -181,6 +193,7 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+#endif
 
 #ifndef CONFIG_UML
 /*
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index e1a5fbeae08d..1dc30f0dfdf7 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -68,9 +68,13 @@ ENTRY(pvh_start_xen)
 	mov $_pa(pvh_start_info), %edi
 	mov %ebx, %esi
 	mov _pa(pvh_start_info_sz), %ecx
+#ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
+	rep movsb
+#else
 	shr $2,%ecx
 	rep
 	movsl
+#endif
 
 	mov $_pa(early_stack_end), %esp
 
diff --git a/scripts/kconfig/cpuid.c b/scripts/kconfig/cpuid.c
index e565dd446bdf..4947f47e7728 100644
--- a/scripts/kconfig/cpuid.c
+++ b/scripts/kconfig/cpuid.c
@@ -43,6 +43,7 @@ static inline void cpuid2(uint32_t eax0, uint32_t ecx0, uint32_t *eax, uint32_t
 }
 
 static bool popcnt	= false;
+static bool rep_movsb	= false;
 
 static uint32_t eax0_max;
 
@@ -57,6 +58,13 @@ static void intel(void)
 		if (ecx & (1 << 23))
 			popcnt = true;
 	}
+	if (eax0_max >= 7) {
+		cpuid2(7, 0, &eax, &ecx, &edx, &ebx);
+//		printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
+
+		if (ebx & (1 << 9))
+			rep_movsb = true;
+	}
 }
 
 int main(int argc, char *argv[])
@@ -76,6 +84,7 @@ int main(int argc, char *argv[])
 
 #define _(x)	if (streq(opt, #x)) return x ? EXIT_SUCCESS : EXIT_FAILURE
 	_(popcnt);
+	_(rep_movsb);
 #undef _
 
 	return EXIT_FAILURE;
diff --git a/scripts/march-native.sh b/scripts/march-native.sh
index 6641e356b646..eb52c20c56b4 100755
--- a/scripts/march-native.sh
+++ b/scripts/march-native.sh
@@ -31,6 +31,7 @@ option() {
 
 if test -x "$CPUID"; then
 	"$CPUID" popcnt		&& option "CONFIG_MARCH_NATIVE_POPCNT"
+	"$CPUID" rep_movsb	&& option "CONFIG_MARCH_NATIVE_REP_MOVSB"
 fi
 
 if test ! -f "$AUTOCONF1" -o ! -f "$AUTOCONF2"; then
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] -march=native: REP STOSB
  2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
  2017-12-07 22:41 ` [PATCH 2/5] -march=native: POPCNT support Alexey Dobriyan
  2017-12-07 22:41 ` [PATCH 3/5] -march=native: REP MOVSB support Alexey Dobriyan
@ 2017-12-07 22:41 ` Alexey Dobriyan
  2017-12-08 19:08   ` Andi Kleen
  2017-12-07 22:41 ` [PATCH 5/5] -march=native: MOVBE support Alexey Dobriyan
  2017-12-07 23:32 ` [PATCH v0 1/5] x86_64: march=native support H. Peter Anvin
  4 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-07 22:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, hpa, Alexey Dobriyan

If CPU advertises fast REP STOSB, use it.

Inline clear_page() to use only 3 registers across function call
not whole shebang as required by ABI.

Also, tell gcc to use REP STOSB for memset(), this saves terabytes of .text.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 Makefile                           |  3 +++
 arch/x86/boot/compressed/head_64.S |  4 ++++
 arch/x86/crypto/sha1_ssse3_asm.S   |  7 ++++++-
 arch/x86/include/asm/page_64.h     | 13 +++++++++++++
 arch/x86/lib/Makefile              |  2 ++
 arch/x86/lib/memset_64.S           | 15 +++++++++++++++
 scripts/kconfig/cpuid.c            |  6 +++++-
 scripts/march-native.sh            |  1 +
 8 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 84abac4c181a..70d91d52ee60 100644
--- a/Makefile
+++ b/Makefile
@@ -593,6 +593,9 @@ endif
 ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
 KBUILD_CFLAGS += -mmemcpy-strategy=rep_byte:-1:align
 endif
+ifdef CONFIG_MARCH_NATIVE_REP_STOSB
+KBUILD_CFLAGS += -mmemset-strategy=rep_byte:-1:align
+endif
 
 ifeq ($(KBUILD_EXTMOD),)
 # Read in dependencies to all Kconfig* files, make sure to run
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 20919b4f3133..a7913f5e18b6 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -447,8 +447,12 @@ relocated:
 	leaq    _bss(%rip), %rdi
 	leaq    _ebss(%rip), %rcx
 	subq	%rdi, %rcx
+#ifdef CONFIG_MARCH_NATIVE_REP_STOSB
+	rep stosb
+#else
 	shrq	$3, %rcx
 	rep	stosq
+#endif
 
 /*
  * Adjust our own GOT
diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S
index 6204bd53528c..ffa41d7a582a 100644
--- a/arch/x86/crypto/sha1_ssse3_asm.S
+++ b/arch/x86/crypto/sha1_ssse3_asm.S
@@ -94,10 +94,15 @@
 	SHA1_PIPELINED_MAIN_BODY
 
 	# cleanup workspace
-	mov	$8, %ecx
 	mov	%rsp, %rdi
 	xor	%rax, %rax
+#ifdef CONFIG_MARCH_NATIVE_REP_STOSB
+	mov	$64, %ecx
+	rep stosb
+#else
+	mov	$8, %ecx
 	rep stosq
+#endif
 
 	mov	%rbp, %rsp		# deallocate workspace
 	pop	%rbp
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index c2353661eaf1..b3d275b07624 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -36,6 +36,18 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 #define pfn_valid(pfn)          ((pfn) < max_pfn)
 #endif
 
+#ifdef CONFIG_MARCH_NATIVE_REP_STOSB
+static __always_inline void clear_page(void *page)
+{
+	uint32_t len = PAGE_SIZE;
+	asm volatile (
+		"rep stosb"
+		: "+D" (page), "+c" (len)
+		: "a" (0)
+		: "memory"
+	);
+}
+#else
 void clear_page_orig(void *page);
 void clear_page_rep(void *page);
 void clear_page_erms(void *page);
@@ -49,6 +61,7 @@ static inline void clear_page(void *page)
 			   "0" (page)
 			   : "memory", "rax", "rcx");
 }
+#endif
 
 #ifdef CONFIG_MARCH_NATIVE_REP_MOVSB
 static __always_inline void copy_page(void *to, void *from)
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 8f9460bef2ec..6cb356408ebb 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -45,7 +45,9 @@ endif
 else
         obj-y += iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
+ifneq ($(CONFIG_MARCH_NATIVE_REP_STOSB),y)
         lib-y += clear_page_64.o
+endif
 ifneq ($(CONFIG_MARCH_NATIVE_REP_MOVSB),y)
 	lib-y += copy_page_64.o
 endif
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9bc861c71e75..7786d1a65423 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -8,6 +8,20 @@
 
 .weak memset
 
+#ifdef CONFIG_MARCH_NATIVE_REP_STOSB
+ENTRY(memset)
+ENTRY(__memset)
+	mov	%esi, %eax
+	mov	%rdi, %rsi
+	mov	%rdx, %rcx
+	rep stosb
+	mov	%rsi, %rax
+	ret
+ENDPROC(memset)
+ENDPROC(__memset)
+EXPORT_SYMBOL(memset)
+EXPORT_SYMBOL(__memset)
+#else
 /*
  * ISO C memset - set a memory block to a byte value. This function uses fast
  * string to get better performance than the original function. The code is
@@ -140,3 +154,4 @@ ENTRY(memset_orig)
 	jmp .Lafter_bad_alignment
 .Lfinal:
 ENDPROC(memset_orig)
+#endif
diff --git a/scripts/kconfig/cpuid.c b/scripts/kconfig/cpuid.c
index 4947f47e7728..ecb285183581 100644
--- a/scripts/kconfig/cpuid.c
+++ b/scripts/kconfig/cpuid.c
@@ -44,6 +44,7 @@ static inline void cpuid2(uint32_t eax0, uint32_t ecx0, uint32_t *eax, uint32_t
 
 static bool popcnt	= false;
 static bool rep_movsb	= false;
+static bool rep_stosb	= false;
 
 static uint32_t eax0_max;
 
@@ -62,8 +63,10 @@ static void intel(void)
 		cpuid2(7, 0, &eax, &ecx, &edx, &ebx);
 //		printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
 
-		if (ebx & (1 << 9))
+		if (ebx & (1 << 9)) {
 			rep_movsb = true;
+			rep_stosb = true;
+		}
 	}
 }
 
@@ -85,6 +88,7 @@ int main(int argc, char *argv[])
 #define _(x)	if (streq(opt, #x)) return x ? EXIT_SUCCESS : EXIT_FAILURE
 	_(popcnt);
 	_(rep_movsb);
+	_(rep_stosb);
 #undef _
 
 	return EXIT_FAILURE;
diff --git a/scripts/march-native.sh b/scripts/march-native.sh
index eb52c20c56b4..d3adf0edb2be 100755
--- a/scripts/march-native.sh
+++ b/scripts/march-native.sh
@@ -32,6 +32,7 @@ option() {
 if test -x "$CPUID"; then
 	"$CPUID" popcnt		&& option "CONFIG_MARCH_NATIVE_POPCNT"
 	"$CPUID" rep_movsb	&& option "CONFIG_MARCH_NATIVE_REP_MOVSB"
+	"$CPUID" rep_stosb	&& option "CONFIG_MARCH_NATIVE_REP_STOSB"
 fi
 
 if test ! -f "$AUTOCONF1" -o ! -f "$AUTOCONF2"; then
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] -march=native: MOVBE support
  2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
                   ` (2 preceding siblings ...)
  2017-12-07 22:41 ` [PATCH 4/5] -march=native: REP STOSB Alexey Dobriyan
@ 2017-12-07 22:41 ` Alexey Dobriyan
  2017-12-07 23:32 ` [PATCH v0 1/5] x86_64: march=native support H. Peter Anvin
  4 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-07 22:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, hpa, Alexey Dobriyan

Use MOVBE if it is available.

This doesn't save code size as MOVBE seems to be as long as MOV+BSWAP,
It is not clear if it saves uop, maybe it will in the future.

Do it because it is easy, I guess.
---
 arch/x86/crypto/des3_ede-asm_64.S | 28 ++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit.S            | 12 ++++++++++++
 scripts/kconfig/cpuid.c           |  4 ++++
 scripts/march-native.sh           |  3 ++-
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/des3_ede-asm_64.S b/arch/x86/crypto/des3_ede-asm_64.S
index 8e49ce117494..007319ea1f62 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -159,6 +159,15 @@
 
 #define dummy2(a, b) /*_*/
 
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+#define read_block(io, left, right) \
+	movbe	 (io), left##d; \
+	movbe	4(io), right##d;
+
+#define write_block(io, left, right) \
+	movbe	left##d,   (io); \
+	movbe	right##d, 4(io);
+#else
 #define read_block(io, left, right) \
 	movl    (io), left##d; \
 	movl   4(io), right##d; \
@@ -170,6 +179,7 @@
 	bswapl right##d; \
 	movl   left##d,   (io); \
 	movl   right##d, 4(io);
+#endif
 
 ENTRY(des3_ede_x86_64_crypt_blk)
 	/* input:
@@ -443,6 +453,14 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	pushq %rsi /* dst */
 
 	/* load input */
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+	movbe 0 * 4(%rdx), RL0d;
+	movbe 1 * 4(%rdx), RR0d;
+	movbe 2 * 4(%rdx), RL1d;
+	movbe 3 * 4(%rdx), RR1d;
+	movbe 4 * 4(%rdx), RL2d;
+	movbe 5 * 4(%rdx), RR2d;
+#else
 	movl 0 * 4(%rdx), RL0d;
 	movl 1 * 4(%rdx), RR0d;
 	movl 2 * 4(%rdx), RL1d;
@@ -456,6 +474,7 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	bswapl RR1d;
 	bswapl RL2d;
 	bswapl RR2d;
+#endif
 
 	initial_permutation3(RL, RR);
 
@@ -516,6 +535,14 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 
 	final_permutation3(RR, RL);
 
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+	movbe RR0d, 0 * 4(%rsi);
+	movbe RL0d, 1 * 4(%rsi);
+	movbe RR1d, 2 * 4(%rsi);
+	movbe RL1d, 3 * 4(%rsi);
+	movbe RR2d, 4 * 4(%rsi);
+	movbe RL2d, 5 * 4(%rsi);
+#else
 	bswapl RR0d;
 	bswapl RL0d;
 	bswapl RR1d;
@@ -530,6 +557,7 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	movl RL1d, 3 * 4(%rsi);
 	movl RR2d, 4 * 4(%rsi);
 	movl RL2d, 5 * 4(%rsi);
+#endif
 
 	popq %r15;
 	popq %r14;
diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
index b33093f84528..17fe33750298 100644
--- a/arch/x86/net/bpf_jit.S
+++ b/arch/x86/net/bpf_jit.S
@@ -34,8 +34,12 @@ FUNC(sk_load_word_positive_offset)
 	sub	%esi,%eax		# hlen - offset
 	cmp	$3,%eax
 	jle	bpf_slow_path_word
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+	movbe	(SKBDATA,%rsi),%eax
+#else
 	mov     (SKBDATA,%rsi),%eax
 	bswap   %eax  			/* ntohl() */
+#endif
 	ret
 
 FUNC(sk_load_half)
@@ -80,8 +84,12 @@ FUNC(sk_load_byte_positive_offset)
 bpf_slow_path_word:
 	bpf_slow_path_common(4)
 	js	bpf_error
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+	movbe	32(%rbp),%eax
+#else
 	mov	32(%rbp),%eax
 	bswap	%eax
+#endif
 	ret
 
 bpf_slow_path_half:
@@ -118,8 +126,12 @@ bpf_slow_path_word_neg:
 
 FUNC(sk_load_word_negative_offset)
 	sk_negative_common(4)
+#ifdef CONFIG_MARCH_NATIVE_MOVBE
+	movbe	(%rax), %eax
+#else
 	mov	(%rax), %eax
 	bswap	%eax
+#endif
 	ret
 
 bpf_slow_path_half_neg:
diff --git a/scripts/kconfig/cpuid.c b/scripts/kconfig/cpuid.c
index ecb285183581..2c23c8699ae6 100644
--- a/scripts/kconfig/cpuid.c
+++ b/scripts/kconfig/cpuid.c
@@ -42,6 +42,7 @@ static inline void cpuid2(uint32_t eax0, uint32_t ecx0, uint32_t *eax, uint32_t
 	);
 }
 
+static bool movbe	= false;
 static bool popcnt	= false;
 static bool rep_movsb	= false;
 static bool rep_stosb	= false;
@@ -56,6 +57,8 @@ static void intel(void)
 		cpuid(1, &eax, &ecx, &edx, &ebx);
 //		printf("%08x %08x %08x %08x\n", eax, ecx, edx, ebx);
 
+		if (ecx & (1 << 22))
+			movbe = true;
 		if (ecx & (1 << 23))
 			popcnt = true;
 	}
@@ -86,6 +89,7 @@ int main(int argc, char *argv[])
 		intel();
 
 #define _(x)	if (streq(opt, #x)) return x ? EXIT_SUCCESS : EXIT_FAILURE
+	_(movbe);
 	_(popcnt);
 	_(rep_movsb);
 	_(rep_stosb);
diff --git a/scripts/march-native.sh b/scripts/march-native.sh
index d3adf0edb2be..93f6a9bd4a6c 100755
--- a/scripts/march-native.sh
+++ b/scripts/march-native.sh
@@ -30,6 +30,7 @@ option() {
 }
 
 if test -x "$CPUID"; then
+	"$CPUID" movbe		&& option "CONFIG_MARCH_NATIVE_MOVBE"
 	"$CPUID" popcnt		&& option "CONFIG_MARCH_NATIVE_POPCNT"
 	"$CPUID" rep_movsb	&& option "CONFIG_MARCH_NATIVE_REP_MOVSB"
 	"$CPUID" rep_stosb	&& option "CONFIG_MARCH_NATIVE_REP_STOSB"
@@ -76,7 +77,7 @@ for i in $COLLECT_GCC_OPTIONS; do
 		-mhle)		option "CONFIG_MARCH_NATIVE_HLE"	;;
 		-mlzcnt)	option "CONFIG_MARCH_NATIVE_LZCNT"	;;
 		-mmmx)		option "CONFIG_MARCH_NATIVE_MMX"	;;
-		-mmovbe)	option "CONFIG_MARCH_NATIVE_MOVBE"	;;
+		-mmovbe);;
 		-mpclmul)	option "CONFIG_MARCH_NATIVE_PCLMUL"	;;
 		-mpopcnt);;
 		-mprfchw)	option "CONFIG_MARCH_NATIVE_PREFETCHW"	;;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/5] -march=native: POPCNT support
  2017-12-07 22:41 ` [PATCH 2/5] -march=native: POPCNT support Alexey Dobriyan
@ 2017-12-07 23:07   ` H. Peter Anvin
  2017-12-08 10:09     ` Alexey Dobriyan
  0 siblings, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2017-12-07 23:07 UTC (permalink / raw)
  To: Alexey Dobriyan, linux-kernel; +Cc: x86, tglx, mingo

On 12/07/17 14:41, Alexey Dobriyan wrote:
> Mainline kernel can only generate "popcnt rax, rdi" instruction
> with alternative masquareading as function call. Patch allows
> to generate all POPCNT variations and inlines hweigth*() family of functions.
> 
> 	$ objdump  -dr ../obj/vmlinux | grep popcnt
> 	ffffffff81004f6d:       f3 48 0f b8 c9          popcnt rcx,rcx
> 	ffffffff81008484:       f3 48 0f b8 03          popcnt rax,QWORD PTR [rbx]
> 	ffffffff81073aae:       f3 48 0f b8 d8          popcnt rbx,rax
> 		...
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> ---
>  arch/x86/include/asm/arch_hweight.h | 32 ++++++++++++++++++++++++++++++--
>  arch/x86/lib/Makefile               |  5 ++++-
>  include/linux/bitops.h              |  2 ++
>  lib/Makefile                        |  2 ++
>  scripts/kconfig/cpuid.c             |  6 ++++++
>  scripts/march-native.sh             |  6 +++++-
>  6 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
> index 34a10b2d5b73..58e4f65d8665 100644
> --- a/arch/x86/include/asm/arch_hweight.h
> +++ b/arch/x86/include/asm/arch_hweight.h
> @@ -2,6 +2,34 @@
>  #ifndef _ASM_X86_HWEIGHT_H
>  #define _ASM_X86_HWEIGHT_H
>  
> +#define __HAVE_ARCH_SW_HWEIGHT
> +
> +#ifdef CONFIG_MARCH_NATIVE_POPCNT
> +static inline unsigned int __arch_hweight64(uint64_t x)
> +{
> +	uint64_t rv;
> +	asm ("popcnt %1, %0" : "=r" (rv) : "rm" (x));
> +	return rv;
> +}
> +
> +static inline unsigned int __arch_hweight32(uint32_t x)
> +{
> +	uint32_t rv;
> +	asm ("popcnt %1, %0" : "=r" (rv) : "rm" (x));
> +	return rv;
> +}
> +
> +static inline unsigned int __arch_hweight16(uint16_t x)
> +{
> +	return __arch_hweight32(x);
> +}
> +
> +static inline unsigned int __arch_hweight8(uint8_t x)
> +{
> +	return __arch_hweight32(x);
> +}


-march=native really would be better implemented by examining the macros
generated by gcc which correspond to the selected -m options
(-march=native really just selects a combination of -m options.)  It
seems bizarre to just reimplement the mechanism that already exists.

Now, this specific case would be better done with alternatives; we can
patch in a JMP to an out-of-line stub to mangle the arguments.  Then you
get the benefit on all systems and don't need to decide at compile time.

The reason to use -m options for this would be to be able to use the
__builtin_popcount() and __builtin_popcountl() intrinsics, which would
allow gcc to schedule it and optimize arbitrarily.

So, much more something like:

#ifdef __POPCNT__

static inline unsigned int __arch_hweight64(uint64_t x)
{
	return __builtin_popcountll(x);
}

static inline unsigned int __arch_hweight32(uint32_t x)
{
	return __builtin_popcount(x);
}

#else

/* Assembly code with alternatives */

	/* Enabled alternative */
	popcnt %1, %0

	/* Non-enabled alternative */
	jmp 1f
2:
	.pushsection .altinstr_aux
1:
	pushq %q1		/* pushl %k1 for i386 */
	call ___arch_hweight%z1
	popq %q0		/* popl %k0 for i386 */
	jmp 2b
	.popsection

#endif


The ___arch_hweight[bwlq] functions would have to be written in assembly
with all registers preserved.  The input and output is a common word on
the stack -- 8(%rsp) or 4(%esp) for x86-64 v. i386.

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v0 1/5] x86_64: march=native support
  2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
                   ` (3 preceding siblings ...)
  2017-12-07 22:41 ` [PATCH 5/5] -march=native: MOVBE support Alexey Dobriyan
@ 2017-12-07 23:32 ` H. Peter Anvin
  2017-12-08  9:57   ` Alexey Dobriyan
  4 siblings, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2017-12-07 23:32 UTC (permalink / raw)
  To: Alexey Dobriyan, linux-kernel; +Cc: x86, tglx, mingo

One more thing: you HAVE to make
arch/x86/include/asm/required-features.h aware of any features that the
kernel unconditionally depend on.

Again, using the gcc cpp macros that reflect what bits gcc itself
broadcast.  However, this is perhaps where CONFIG flags become
important, since required-features.h has to be able to be compiled in
the bootcode environment, which is different from the normal kernel
compiler environment.

We could, however, automagically generate a reflection of these as a
header file:

	echo '#ifndef __LINUX_CC_DEFINES__'
	echo '#define __LINUX_CC_DEFINES__'
	$(CC) $(c_flags) -x c -E -Wp,-dM /dev/null | sort | \
	sed -nr -e 's/^#define __([^[:space]]+)__$/#define __KERNEL_CC_\1/p'
	echo '#endif

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v0 1/5] x86_64: march=native support
  2017-12-07 23:32 ` [PATCH v0 1/5] x86_64: march=native support H. Peter Anvin
@ 2017-12-08  9:57   ` Alexey Dobriyan
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-08  9:57 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, x86, tglx, mingo

On 12/8/17, H. Peter Anvin <hpa@zytor.com> wrote:
> One more thing: you HAVE to make
> arch/x86/include/asm/required-features.h aware of any features that the
> kernel unconditionally depend on.

Yes, this is foolprof part I have to think through.

> Again, using the gcc cpp macros that reflect what bits gcc itself
> broadcast.  However, this is perhaps where CONFIG flags become
> important, since required-features.h has to be able to be compiled in
> the bootcode environment, which is different from the normal kernel
> compiler environment.
>
> We could, however, automagically generate a reflection of these as a
> header file:
>
> 	echo '#ifndef __LINUX_CC_DEFINES__'
> 	echo '#define __LINUX_CC_DEFINES__'
> 	$(CC) $(c_flags) -x c -E -Wp,-dM /dev/null | sort | \
> 	sed -nr -e 's/^#define __([^[:space]]+)__$/#define __KERNEL_CC_\1/p'
> 	echo '#endif

A lot of then aren't interesting and duplicate each other.
Another thing: clang. It detects machine I'm typing this as __corei7__
while gcc does it as __core_avx2__.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/5] -march=native: POPCNT support
  2017-12-07 23:07   ` H. Peter Anvin
@ 2017-12-08 10:09     ` Alexey Dobriyan
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2017-12-08 10:09 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, x86, tglx, mingo

On 12/8/17, H. Peter Anvin <hpa@zytor.com> wrote:
> On 12/07/17 14:41, Alexey Dobriyan wrote:
>> Mainline kernel can only generate "popcnt rax, rdi" instruction
>> with alternative masquareading as function call. Patch allows
>> to generate all POPCNT variations and inlines hweigth*() family of
>> functions.

> -march=native really would be better implemented by examining the macros
> generated by gcc which correspond to the selected -m options
> (-march=native really just selects a combination of -m options.)  It
> seems bizarre to just reimplement the mechanism that already exists.

This mechanism can do feature detection part only.

Code generation tweaks (uop fusing etc) require passing -march=native,
and hardly can be expressed through defines.

Some things aren't recorded in defines (--param l1-cache-size),
it is not clear how and what to optimize base on cache sizes,
but if kernel controls CPU detection code there is no need to wait
when someone smart comes up with an idea.

Again, clang emits slightly different defines.

> Now, this specific case would be better done with alternatives; we can
> patch in a JMP to an out-of-line stub to mangle the arguments.  Then you
> get the benefit on all systems and don't need to decide at compile time.
>
> The reason to use -m options for this would be to be able to use the
> __builtin_popcount() and __builtin_popcountl() intrinsics, which would
> allow gcc to schedule it and optimize arbitrarily.
>
> So, much more something like:
>
> #ifdef __POPCNT__
>
> static inline unsigned int __arch_hweight64(uint64_t x)
> {
> 	return __builtin_popcountll(x);

OK

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/5] -march=native: REP STOSB
  2017-12-07 22:41 ` [PATCH 4/5] -march=native: REP STOSB Alexey Dobriyan
@ 2017-12-08 19:08   ` Andi Kleen
  0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2017-12-08 19:08 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-kernel, x86, tglx, mingo, hpa

Alexey Dobriyan <adobriyan@gmail.com> writes:
>  
> +#ifdef CONFIG_MARCH_NATIVE_REP_STOSB
> +static __always_inline void clear_page(void *page)
> +{
> +	uint32_t len = PAGE_SIZE;
> +	asm volatile (
> +		"rep stosb"
> +		: "+D" (page), "+c" (len)
> +		: "a" (0)
> +		: "memory"
> +	);
> +}

clear_page_64 already patches in the equivalent code sequence,
it's clear_page_erms()

It's very doubtful that this patch is worth it.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-12-08 19:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-07 22:41 [PATCH v0 1/5] x86_64: march=native support Alexey Dobriyan
2017-12-07 22:41 ` [PATCH 2/5] -march=native: POPCNT support Alexey Dobriyan
2017-12-07 23:07   ` H. Peter Anvin
2017-12-08 10:09     ` Alexey Dobriyan
2017-12-07 22:41 ` [PATCH 3/5] -march=native: REP MOVSB support Alexey Dobriyan
2017-12-07 22:41 ` [PATCH 4/5] -march=native: REP STOSB Alexey Dobriyan
2017-12-08 19:08   ` Andi Kleen
2017-12-07 22:41 ` [PATCH 5/5] -march=native: MOVBE support Alexey Dobriyan
2017-12-07 23:32 ` [PATCH v0 1/5] x86_64: march=native support H. Peter Anvin
2017-12-08  9:57   ` Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).