All of lore.kernel.org
 help / color / mirror / Atom feed
From: Douglas Anderson <dianders@chromium.org>
To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>
Cc: Jackie Liu <liuyun01@kylinos.cn>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Douglas Anderson <dianders@chromium.org>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
Date: Mon, 21 Sep 2020 17:26:08 -0700	[thread overview]
Message-ID: <20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid> (raw)

On every boot time we see messages like this:

[    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
[    0.025363] xor: measuring software checksum speed
[    0.035351]    8regs     :  3952.000 MB/sec
[    0.045384]    32regs    :  4860.000 MB/sec
[    0.055418]    arm64_neon:  5900.000 MB/sec
[    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
[    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs

As you can see, we spend 30 ms on every boot re-confirming that, yet
again, the arm64_neon implementation is the fastest way to do XOR.
...and the above is on a system with HZ=1000.  Due to the way the
testing happens, if we have HZ defined to something slower it'll take
much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
a fact that will be the same for every bootup.

Trying to super-optimize the xor operation makes a lot of sense if
you're using software RAID, but the above is probably not worth it for
most Linux users because:
1. Quite a few arm64 kernels are built for embedded systems where
   software raid isn't common.  That means we're spending lots of time
   on every boot trying to optimize something we don't use.
2. Presumably, if we have neon, it's faster than alternatives.  If
   it's not, it's not expected to be tons slower.
3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
   existing test is somewhat misguided because it's assuming that test
   results on the boot CPU apply to the other CPUs in the system.
   This is not necessarily the case.

Let's add a new config option that allows us to just use the neon
functions (if present) without benchmarking.

NOTE: One small side effect is that on an arm64 system _without_ neon
we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
versions of the function.  That's presumably OK since we already test
all those when KERNEL_MODE_NEON is disabled.

ALSO NOTE: presumably the way to do better than this is to add some
sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
XOR function each time xor is called.  Without seeing evidence that
this would really help someone, though, that doesn't seem worth it.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/Kconfig           | 15 +++++++++++++++
 arch/arm64/include/asm/xor.h |  5 +++++
 2 files changed, 20 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 64ae5e4eb814..fc18df45a5f8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -306,6 +306,21 @@ config SMP
 config KERNEL_MODE_NEON
 	def_bool y
 
+menuconfig FORCE_NEON_XOR_IF_AVAILABLE
+	bool "Assume neon is fastest for xor if the CPU supports it"
+	default y
+	depends on KERNEL_MODE_NEON
+	help
+	  Normally the kernel will run through several different XOR
+	  algorithms at boot, timing them on the boot processor to see
+	  which is fastest. This can take quite some time. On many
+	  machines it's expected that, if NEON is available, it's going
+	  to provide the fastest implementation. If you set this option
+	  we'll skip testing this every boot and just assume NEON is the
+	  fastest if present. Setting this option will speed up your
+	  boot but you might end up with a less-optimal xor
+	  implementation.
+
 config FIX_EARLYCON_MEM
 	def_bool y
 
diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h
index 947f6a4f1aa0..1acb290866ab 100644
--- a/arch/arm64/include/asm/xor.h
+++ b/arch/arm64/include/asm/xor.h
@@ -57,6 +57,10 @@ static struct xor_block_template xor_block_arm64 = {
 	.do_4   = xor_neon_4,
 	.do_5	= xor_neon_5
 };
+#ifdef CONFIG_FORCE_NEON_XOR_IF_AVAILABLE
+#define XOR_SELECT_TEMPLATE(FASTEST) \
+	(cpu_has_neon() ? &xor_block_arm64 : FASTEST)
+#else /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 #undef XOR_TRY_TEMPLATES
 #define XOR_TRY_TEMPLATES           \
 	do {        \
@@ -66,5 +70,6 @@ static struct xor_block_template xor_block_arm64 = {
 			xor_speed(&xor_block_arm64);\
 		} \
 	} while (0)
+#endif /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 
 #endif /* ! CONFIG_KERNEL_MODE_NEON */
-- 
2.28.0.681.g6f77f65b4e-goog


WARNING: multiple messages have this Message-ID (diff)
From: Douglas Anderson <dianders@chromium.org>
To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>
Cc: linux-kernel@vger.kernel.org, Jackie Liu <liuyun01@kylinos.cn>,
	Douglas Anderson <dianders@chromium.org>,
	linux-arm-kernel@lists.infradead.org,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
Date: Mon, 21 Sep 2020 17:26:08 -0700	[thread overview]
Message-ID: <20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid> (raw)

On every boot time we see messages like this:

[    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
[    0.025363] xor: measuring software checksum speed
[    0.035351]    8regs     :  3952.000 MB/sec
[    0.045384]    32regs    :  4860.000 MB/sec
[    0.055418]    arm64_neon:  5900.000 MB/sec
[    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
[    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs

As you can see, we spend 30 ms on every boot re-confirming that, yet
again, the arm64_neon implementation is the fastest way to do XOR.
...and the above is on a system with HZ=1000.  Due to the way the
testing happens, if we have HZ defined to something slower it'll take
much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
a fact that will be the same for every bootup.

Trying to super-optimize the xor operation makes a lot of sense if
you're using software RAID, but the above is probably not worth it for
most Linux users because:
1. Quite a few arm64 kernels are built for embedded systems where
   software raid isn't common.  That means we're spending lots of time
   on every boot trying to optimize something we don't use.
2. Presumably, if we have neon, it's faster than alternatives.  If
   it's not, it's not expected to be tons slower.
3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
   existing test is somewhat misguided because it's assuming that test
   results on the boot CPU apply to the other CPUs in the system.
   This is not necessarily the case.

Let's add a new config option that allows us to just use the neon
functions (if present) without benchmarking.

NOTE: One small side effect is that on an arm64 system _without_ neon
we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
versions of the function.  That's presumably OK since we already test
all those when KERNEL_MODE_NEON is disabled.

ALSO NOTE: presumably the way to do better than this is to add some
sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
XOR function each time xor is called.  Without seeing evidence that
this would really help someone, though, that doesn't seem worth it.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/Kconfig           | 15 +++++++++++++++
 arch/arm64/include/asm/xor.h |  5 +++++
 2 files changed, 20 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 64ae5e4eb814..fc18df45a5f8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -306,6 +306,21 @@ config SMP
 config KERNEL_MODE_NEON
 	def_bool y
 
+menuconfig FORCE_NEON_XOR_IF_AVAILABLE
+	bool "Assume neon is fastest for xor if the CPU supports it"
+	default y
+	depends on KERNEL_MODE_NEON
+	help
+	  Normally the kernel will run through several different XOR
+	  algorithms at boot, timing them on the boot processor to see
+	  which is fastest. This can take quite some time. On many
+	  machines it's expected that, if NEON is available, it's going
+	  to provide the fastest implementation. If you set this option
+	  we'll skip testing this every boot and just assume NEON is the
+	  fastest if present. Setting this option will speed up your
+	  boot but you might end up with a less-optimal xor
+	  implementation.
+
 config FIX_EARLYCON_MEM
 	def_bool y
 
diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h
index 947f6a4f1aa0..1acb290866ab 100644
--- a/arch/arm64/include/asm/xor.h
+++ b/arch/arm64/include/asm/xor.h
@@ -57,6 +57,10 @@ static struct xor_block_template xor_block_arm64 = {
 	.do_4   = xor_neon_4,
 	.do_5	= xor_neon_5
 };
+#ifdef CONFIG_FORCE_NEON_XOR_IF_AVAILABLE
+#define XOR_SELECT_TEMPLATE(FASTEST) \
+	(cpu_has_neon() ? &xor_block_arm64 : FASTEST)
+#else /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 #undef XOR_TRY_TEMPLATES
 #define XOR_TRY_TEMPLATES           \
 	do {        \
@@ -66,5 +70,6 @@ static struct xor_block_template xor_block_arm64 = {
 			xor_speed(&xor_block_arm64);\
 		} \
 	} while (0)
+#endif /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 
 #endif /* ! CONFIG_KERNEL_MODE_NEON */
-- 
2.28.0.681.g6f77f65b4e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

             reply	other threads:[~2020-09-22  0:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-22  0:26 Douglas Anderson [this message]
2020-09-22  0:26 ` [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest Douglas Anderson
2020-09-22  6:25 ` Ard Biesheuvel
2020-09-22  6:25   ` Ard Biesheuvel
2020-09-23  0:39   ` Doug Anderson
2020-09-23  0:39     ` Doug Anderson
2020-09-22  8:25 ` David Laight
2020-09-22  8:25   ` David Laight
2020-09-22 10:30   ` Ard Biesheuvel
2020-09-22 10:30     ` Ard Biesheuvel
2020-09-23  0:39     ` Doug Anderson
2020-09-23  0:39       ` Doug Anderson
2020-09-23 10:14       ` Ard Biesheuvel
2020-09-23 10:14         ` Ard Biesheuvel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid \
    --to=dianders@chromium.org \
    --cc=ard.biesheuvel@linaro.org \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liuyun01@kylinos.cn \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.