linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
@ 2020-09-22  0:26 Douglas Anderson
  2020-09-22  6:25 ` Ard Biesheuvel
  2020-09-22  8:25 ` David Laight
  0 siblings, 2 replies; 7+ messages in thread
From: Douglas Anderson @ 2020-09-22  0:26 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: linux-kernel, Jackie Liu, Douglas Anderson, linux-arm-kernel,
	Ard Biesheuvel

On every boot time we see messages like this:

[    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
[    0.025363] xor: measuring software checksum speed
[    0.035351]    8regs     :  3952.000 MB/sec
[    0.045384]    32regs    :  4860.000 MB/sec
[    0.055418]    arm64_neon:  5900.000 MB/sec
[    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
[    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs

As you can see, we spend 30 ms on every boot re-confirming that, yet
again, the arm64_neon implementation is the fastest way to do XOR.
...and the above is on a system with HZ=1000.  Due to the way the
testing happens, if we have HZ defined to something slower it'll take
much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
a fact that will be the same for every bootup.

Trying to super-optimize the xor operation makes a lot of sense if
you're using software RAID, but the above is probably not worth it for
most Linux users because:
1. Quite a few arm64 kernels are built for embedded systems where
   software raid isn't common.  That means we're spending lots of time
   on every boot trying to optimize something we don't use.
2. Presumably, if we have neon, it's faster than alternatives.  If
   it's not, it's not expected to be tons slower.
3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
   existing test is somewhat misguided because it's assuming that test
   results on the boot CPU apply to the other CPUs in the system.
   This is not necessarily the case.

Let's add a new config option that allows us to just use the neon
functions (if present) without benchmarking.

NOTE: One small side effect is that on an arm64 system _without_ neon
we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
versions of the function.  That's presumably OK since we already test
all those when KERNEL_MODE_NEON is disabled.

ALSO NOTE: presumably the way to do better than this is to add some
sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
XOR function each time xor is called.  Without seeing evidence that
this would really help someone, though, that doesn't seem worth it.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/Kconfig           | 15 +++++++++++++++
 arch/arm64/include/asm/xor.h |  5 +++++
 2 files changed, 20 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 64ae5e4eb814..fc18df45a5f8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -306,6 +306,21 @@ config SMP
 config KERNEL_MODE_NEON
 	def_bool y
 
+menuconfig FORCE_NEON_XOR_IF_AVAILABLE
+	bool "Assume neon is fastest for xor if the CPU supports it"
+	default y
+	depends on KERNEL_MODE_NEON
+	help
+	  Normally the kernel will run through several different XOR
+	  algorithms at boot, timing them on the boot processor to see
+	  which is fastest. This can take quite some time. On many
+	  machines it's expected that, if NEON is available, it's going
+	  to provide the fastest implementation. If you set this option
+	  we'll skip testing this every boot and just assume NEON is the
+	  fastest if present. Setting this option will speed up your
+	  boot but you might end up with a less-optimal xor
+	  implementation.
+
 config FIX_EARLYCON_MEM
 	def_bool y
 
diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h
index 947f6a4f1aa0..1acb290866ab 100644
--- a/arch/arm64/include/asm/xor.h
+++ b/arch/arm64/include/asm/xor.h
@@ -57,6 +57,10 @@ static struct xor_block_template xor_block_arm64 = {
 	.do_4   = xor_neon_4,
 	.do_5	= xor_neon_5
 };
+#ifdef CONFIG_FORCE_NEON_XOR_IF_AVAILABLE
+#define XOR_SELECT_TEMPLATE(FASTEST) \
+	(cpu_has_neon() ? &xor_block_arm64 : FASTEST)
+#else /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 #undef XOR_TRY_TEMPLATES
 #define XOR_TRY_TEMPLATES           \
 	do {        \
@@ -66,5 +70,6 @@ static struct xor_block_template xor_block_arm64 = {
 			xor_speed(&xor_block_arm64);\
 		} \
 	} while (0)
+#endif /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
 
 #endif /* ! CONFIG_KERNEL_MODE_NEON */
-- 
2.28.0.681.g6f77f65b4e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-22  0:26 [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest Douglas Anderson
@ 2020-09-22  6:25 ` Ard Biesheuvel
  2020-09-23  0:39   ` Doug Anderson
  2020-09-22  8:25 ` David Laight
  1 sibling, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2020-09-22  6:25 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Ard Biesheuvel, Catalin Marinas, Jackie Liu,
	Linux Kernel Mailing List, Will Deacon, Linux ARM

On Tue, 22 Sep 2020 at 02:27, Douglas Anderson <dianders@chromium.org> wrote:
>
> On every boot time we see messages like this:
>
> [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> [    0.025363] xor: measuring software checksum speed
> [    0.035351]    8regs     :  3952.000 MB/sec
> [    0.045384]    32regs    :  4860.000 MB/sec
> [    0.055418]    arm64_neon:  5900.000 MB/sec
> [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
>
> As you can see, we spend 30 ms on every boot re-confirming that, yet
> again, the arm64_neon implementation is the fastest way to do XOR.
> ...and the above is on a system with HZ=1000.  Due to the way the
> testing happens, if we have HZ defined to something slower it'll take
> much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> a fact that will be the same for every bootup.
>
> Trying to super-optimize the xor operation makes a lot of sense if
> you're using software RAID, but the above is probably not worth it for
> most Linux users because:
> 1. Quite a few arm64 kernels are built for embedded systems where
>    software raid isn't common.  That means we're spending lots of time
>    on every boot trying to optimize something we don't use.
> 2. Presumably, if we have neon, it's faster than alternatives.  If
>    it's not, it's not expected to be tons slower.
> 3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
>    existing test is somewhat misguided because it's assuming that test
>    results on the boot CPU apply to the other CPUs in the system.
>    This is not necessarily the case.
>
> Let's add a new config option that allows us to just use the neon
> functions (if present) without benchmarking.
>
> NOTE: One small side effect is that on an arm64 system _without_ neon
> we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
> versions of the function.  That's presumably OK since we already test
> all those when KERNEL_MODE_NEON is disabled.
>
> ALSO NOTE: presumably the way to do better than this is to add some
> sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
> XOR function each time xor is called.  Without seeing evidence that
> this would really help someone, though, that doesn't seem worth it.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

On the two arm64 machines that I happen to have running right now, I get

SynQuacer (Cortex-A53)

    8regs     :  1917.000 MB/sec
    32regs    :  2270.000 MB/sec
    arm64_neon:  2053.000 MB/sec

ThunderX2

    8regs     : 10170.000 MB/sec
    32regs    : 12051.000 MB/sec
    arm64_neon: 10948.000 MB/sec

so your assertion is not entirely valid.

If the system does not need XOR, it is free not to load the module, so
there is no reason it has to affect the boot time.

What we /can/ do is remove 8regs - arm64 has plenty of registers so I
don't think it will ever be the fastest.



> ---
>
>  arch/arm64/Kconfig           | 15 +++++++++++++++
>  arch/arm64/include/asm/xor.h |  5 +++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 64ae5e4eb814..fc18df45a5f8 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -306,6 +306,21 @@ config SMP
>  config KERNEL_MODE_NEON
>         def_bool y
>
> +menuconfig FORCE_NEON_XOR_IF_AVAILABLE
> +       bool "Assume neon is fastest for xor if the CPU supports it"
> +       default y
> +       depends on KERNEL_MODE_NEON
> +       help
> +         Normally the kernel will run through several different XOR
> +         algorithms at boot, timing them on the boot processor to see
> +         which is fastest. This can take quite some time. On many
> +         machines it's expected that, if NEON is available, it's going
> +         to provide the fastest implementation. If you set this option
> +         we'll skip testing this every boot and just assume NEON is the
> +         fastest if present. Setting this option will speed up your
> +         boot but you might end up with a less-optimal xor
> +         implementation.
> +
>  config FIX_EARLYCON_MEM
>         def_bool y
>
> diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h
> index 947f6a4f1aa0..1acb290866ab 100644
> --- a/arch/arm64/include/asm/xor.h
> +++ b/arch/arm64/include/asm/xor.h
> @@ -57,6 +57,10 @@ static struct xor_block_template xor_block_arm64 = {
>         .do_4   = xor_neon_4,
>         .do_5   = xor_neon_5
>  };
> +#ifdef CONFIG_FORCE_NEON_XOR_IF_AVAILABLE
> +#define XOR_SELECT_TEMPLATE(FASTEST) \
> +       (cpu_has_neon() ? &xor_block_arm64 : FASTEST)
> +#else /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
>  #undef XOR_TRY_TEMPLATES
>  #define XOR_TRY_TEMPLATES           \
>         do {        \
> @@ -66,5 +70,6 @@ static struct xor_block_template xor_block_arm64 = {
>                         xor_speed(&xor_block_arm64);\
>                 } \
>         } while (0)
> +#endif /* ! CONFIG_FORCE_NEON_XOR_IF_AVAILABLE */
>
>  #endif /* ! CONFIG_KERNEL_MODE_NEON */
> --
> 2.28.0.681.g6f77f65b4e-goog
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-22  0:26 [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest Douglas Anderson
  2020-09-22  6:25 ` Ard Biesheuvel
@ 2020-09-22  8:25 ` David Laight
  2020-09-22 10:30   ` Ard Biesheuvel
  1 sibling, 1 reply; 7+ messages in thread
From: David Laight @ 2020-09-22  8:25 UTC (permalink / raw)
  To: 'Douglas Anderson', Catalin Marinas, Will Deacon
  Cc: Jackie Liu, linux-kernel, linux-arm-kernel, Ard Biesheuvel

From: Douglas Anderson
> Sent: 22 September 2020 01:26
> 
> On every boot time we see messages like this:
> 
> [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> [    0.025363] xor: measuring software checksum speed
> [    0.035351]    8regs     :  3952.000 MB/sec
> [    0.045384]    32regs    :  4860.000 MB/sec
> [    0.055418]    arm64_neon:  5900.000 MB/sec
> [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> 
> As you can see, we spend 30 ms on every boot re-confirming that, yet
> again, the arm64_neon implementation is the fastest way to do XOR.
> ...and the above is on a system with HZ=1000.  Due to the way the
> testing happens, if we have HZ defined to something slower it'll take
> much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> a fact that will be the same for every bootup.

Can't the code use a TSC (or similar high-res counter) to
see how long it takes to process a short 'hot cache' block?
That wouldn't take long at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-22  8:25 ` David Laight
@ 2020-09-22 10:30   ` Ard Biesheuvel
  2020-09-23  0:39     ` Doug Anderson
  0 siblings, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2020-09-22 10:30 UTC (permalink / raw)
  To: David Laight
  Cc: Ard Biesheuvel, Catalin Marinas, Jackie Liu, Douglas Anderson,
	linux-kernel, Will Deacon, linux-arm-kernel

On Tue, 22 Sep 2020 at 10:26, David Laight <David.Laight@aculab.com> wrote:
>
> From: Douglas Anderson
> > Sent: 22 September 2020 01:26
> >
> > On every boot time we see messages like this:
> >
> > [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> > [    0.025363] xor: measuring software checksum speed
> > [    0.035351]    8regs     :  3952.000 MB/sec
> > [    0.045384]    32regs    :  4860.000 MB/sec
> > [    0.055418]    arm64_neon:  5900.000 MB/sec
> > [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> >
> > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > again, the arm64_neon implementation is the fastest way to do XOR.
> > ...and the above is on a system with HZ=1000.  Due to the way the
> > testing happens, if we have HZ defined to something slower it'll take
> > much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> > a fact that will be the same for every bootup.
>
> Can't the code use a TSC (or similar high-res counter) to
> see how long it takes to process a short 'hot cache' block?
> That wouldn't take long at all.
>

This is generic code that runs from an core_initcall() so I am not
sure we can easily implement this in a portable way.

Doug: would it help if we deferred this until late_initcall()? We
could take an arbitrary pick from the list at core_initcall() time to
serve early users, and update to the fastest one at a later time.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-22  6:25 ` Ard Biesheuvel
@ 2020-09-23  0:39   ` Doug Anderson
  0 siblings, 0 replies; 7+ messages in thread
From: Doug Anderson @ 2020-09-23  0:39 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, Catalin Marinas, Jackie Liu,
	Linux Kernel Mailing List, Will Deacon, Linux ARM

On Mon, Sep 21, 2020 at 11:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 22 Sep 2020 at 02:27, Douglas Anderson <dianders@chromium.org> wrote:
> >
> > On every boot time we see messages like this:
> >
> > [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> > [    0.025363] xor: measuring software checksum speed
> > [    0.035351]    8regs     :  3952.000 MB/sec
> > [    0.045384]    32regs    :  4860.000 MB/sec
> > [    0.055418]    arm64_neon:  5900.000 MB/sec
> > [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> >
> > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > again, the arm64_neon implementation is the fastest way to do XOR.
> > ...and the above is on a system with HZ=1000.  Due to the way the
> > testing happens, if we have HZ defined to something slower it'll take
> > much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> > a fact that will be the same for every bootup.
> >
> > Trying to super-optimize the xor operation makes a lot of sense if
> > you're using software RAID, but the above is probably not worth it for
> > most Linux users because:
> > 1. Quite a few arm64 kernels are built for embedded systems where
> >    software raid isn't common.  That means we're spending lots of time
> >    on every boot trying to optimize something we don't use.
> > 2. Presumably, if we have neon, it's faster than alternatives.  If
> >    it's not, it's not expected to be tons slower.
> > 3. Quite a lot of arm64 systems are big.LITTLE.  This means that the
> >    existing test is somewhat misguided because it's assuming that test
> >    results on the boot CPU apply to the other CPUs in the system.
> >    This is not necessarily the case.
> >
> > Let's add a new config option that allows us to just use the neon
> > functions (if present) without benchmarking.
> >
> > NOTE: One small side effect is that on an arm64 system _without_ neon
> > we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
> > versions of the function.  That's presumably OK since we already test
> > all those when KERNEL_MODE_NEON is disabled.
> >
> > ALSO NOTE: presumably the way to do better than this is to add some
> > sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
> > XOR function each time xor is called.  Without seeing evidence that
> > this would really help someone, though, that doesn't seem worth it.
> >
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
>
> On the two arm64 machines that I happen to have running right now, I get
>
> SynQuacer (Cortex-A53)
>
>     8regs     :  1917.000 MB/sec
>     32regs    :  2270.000 MB/sec
>     arm64_neon:  2053.000 MB/sec
>
> ThunderX2
>
>     8regs     : 10170.000 MB/sec
>     32regs    : 12051.000 MB/sec
>     arm64_neon: 10948.000 MB/sec
>
> so your assertion is not entirely valid.

OK, good to know.


> If the system does not need XOR, it is free not to load the module, so
> there is no reason it has to affect the boot time.

The fact that it was run super early somehow made me just assume that
this couldn't be a module, but of course you're right that it can be a
module.  That works for me and saves me my precious boot time.  ;-)

That being said, this'll still bite anyone who wants to build this in
for whatever reason.  I'll respond to your other email with more...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-22 10:30   ` Ard Biesheuvel
@ 2020-09-23  0:39     ` Doug Anderson
  2020-09-23 10:14       ` Ard Biesheuvel
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Anderson @ 2020-09-23  0:39 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, Catalin Marinas, Jackie Liu, linux-kernel,
	David Laight, Will Deacon, linux-arm-kernel

Hi,

On Tue, Sep 22, 2020 at 3:30 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 22 Sep 2020 at 10:26, David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Douglas Anderson
> > > Sent: 22 September 2020 01:26
> > >
> > > On every boot time we see messages like this:
> > >
> > > [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> > > [    0.025363] xor: measuring software checksum speed
> > > [    0.035351]    8regs     :  3952.000 MB/sec
> > > [    0.045384]    32regs    :  4860.000 MB/sec
> > > [    0.055418]    arm64_neon:  5900.000 MB/sec
> > > [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > > [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> > >
> > > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > > again, the arm64_neon implementation is the fastest way to do XOR.
> > > ...and the above is on a system with HZ=1000.  Due to the way the
> > > testing happens, if we have HZ defined to something slower it'll take
> > > much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> > > a fact that will be the same for every bootup.
> >
> > Can't the code use a TSC (or similar high-res counter) to
> > see how long it takes to process a short 'hot cache' block?
> > That wouldn't take long at all.
> >
>
> This is generic code that runs from an core_initcall() so I am not
> sure we can easily implement this in a portable way.

If it ran later, presumably you could just use ktime?  That seems like
it'd be a portable enough way?


> Doug: would it help if we deferred this until late_initcall()? We
> could take an arbitrary pick from the list at core_initcall() time to
> serve early users, and update to the fastest one at a later time.

Yeah, I think that'd work OK.  One advantage of it being later would
be that it could run in parallel to other things that were happening
in the system (anyone who enabled async probe on their driver).  Even
better would be if your code itself could run async and not block the
rest of boot.  ;-)  I do like the idea that we could just arbitrarily
pick one implementation until we've calibrated.  I guess we'd want to
figure out how to do this lockless but it shouldn't be too hard to
just check to see if a single pointer is non-NULL and once it becomes
non-NULL then you can use it...  ...or a pointer plus a sentinel if
writing the pointer can't be done atomically...

It also feels like with the large number of big.LITTLE systems out
there you'd either want a lookup table per core or you'd want to do
calibration per core.

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest
  2020-09-23  0:39     ` Doug Anderson
@ 2020-09-23 10:14       ` Ard Biesheuvel
  0 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2020-09-23 10:14 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Ard Biesheuvel, Catalin Marinas, Jackie Liu, linux-kernel,
	David Laight, Will Deacon, linux-arm-kernel

On Wed, 23 Sep 2020 at 02:39, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Sep 22, 2020 at 3:30 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 10:26, David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Douglas Anderson
> > > > Sent: 22 September 2020 01:26
> > > >
> > > > On every boot time we see messages like this:
> > > >
> > > > [    0.025360] calling  calibrate_xor_blocks+0x0/0x134 @ 1
> > > > [    0.025363] xor: measuring software checksum speed
> > > > [    0.035351]    8regs     :  3952.000 MB/sec
> > > > [    0.045384]    32regs    :  4860.000 MB/sec
> > > > [    0.055418]    arm64_neon:  5900.000 MB/sec
> > > > [    0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > > > [    0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> > > >
> > > > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > > > again, the arm64_neon implementation is the fastest way to do XOR.
> > > > ...and the above is on a system with HZ=1000.  Due to the way the
> > > > testing happens, if we have HZ defined to something slower it'll take
> > > > much longer.  HZ=100 means we spend 300 ms on every boot re-confirming
> > > > a fact that will be the same for every bootup.
> > >
> > > Can't the code use a TSC (or similar high-res counter) to
> > > see how long it takes to process a short 'hot cache' block?
> > > That wouldn't take long at all.
> > >
> >
> > This is generic code that runs from an core_initcall() so I am not
> > sure we can easily implement this in a portable way.
>
> If it ran later, presumably you could just use ktime?  That seems like
> it'd be a portable enough way?
>

That should work, I suppose. That should also permit us to simply time
N iterations of the benchmark instead of running it as many times as
we can while waiting for a jiffy to elapse.

>
> > Doug: would it help if we deferred this until late_initcall()? We
> > could take an arbitrary pick from the list at core_initcall() time to
> > serve early users, and update to the fastest one at a later time.
>
> Yeah, I think that'd work OK.  One advantage of it being later would
> be that it could run in parallel to other things that were happening
> in the system (anyone who enabled async probe on their driver).  Even
> better would be if your code itself could run async and not block the
> rest of boot.  ;-)

My code? :-)

> I do like the idea that we could just arbitrarily
> pick one implementation until we've calibrated.  I guess we'd want to
> figure out how to do this lockless but it shouldn't be too hard to
> just check to see if a single pointer is non-NULL and once it becomes
> non-NULL then you can use it...  ...or a pointer plus a sentinel if
> writing the pointer can't be done atomically...
>

Surely, any SMP capable architecture that cares about atomicity at
that level can update a function pointer, which is guaranteed to be
the native word size, without tearing?

This should do it afaict:

--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -21,7 +21,7 @@
 #endif

 /* The xor routines to use.  */
-static struct xor_block_template *active_template;
+static struct xor_block_template *active_template = xor_block_8regs;

 void
 xor_blocks(unsigned int src_count, unsigned int bytes, void *dest, void **srcs)
@@ -150,6 +150,5 @@ static __exit void xor_exit(void) { }

 MODULE_LICENSE("GPL");

-/* when built-in xor.o must initialize before drivers/md/md.o */
-core_initcall(calibrate_xor_blocks);
+late_initcall(calibrate_xor_blocks);
 module_exit(xor_exit);


> It also feels like with the large number of big.LITTLE systems out
> there you'd either want a lookup table per core or you'd want to do
> calibration per core.
>

I don't think the complexity is worth it, tbh, as there are too many
parameters to consider, although it would be nice if we could run the
benchmark on the best performing CPU (as that is where the scheduler
will run the code if it is on a sufficiently hot path, and if it is
not, it doesn't really matter)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-23 10:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-22  0:26 [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest Douglas Anderson
2020-09-22  6:25 ` Ard Biesheuvel
2020-09-23  0:39   ` Doug Anderson
2020-09-22  8:25 ` David Laight
2020-09-22 10:30   ` Ard Biesheuvel
2020-09-23  0:39     ` Doug Anderson
2020-09-23 10:14       ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).