[PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark
@ 2020-09-26 10:26 Ard Biesheuvel
  2020-09-26 10:26 ` [PATCH v2 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ard Biesheuvel @ 2020-09-26 10:26 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Doug reports [0] that the XOR boot time benchmark takes more time than
necessary, and runs at a time when there is little room for other
boot time tasks to run concurrently.

Let's fix this by #1 deferring the benchmark, and #2 uses a faster
implementation.

Changes since v2:
- incorporate Doug's review feedback re coarse clocks and the use of pr_info
- add Doug's ack to #1

[0] https://lore.kernel.org/linux-arm-kernel/20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid/

Cc: Douglas Anderson <dianders@chromium.org>
Cc: David Laight <David.Laight@aculab.com>

Ard Biesheuvel (2):
  crypto: xor - defer load time benchmark to a later time
  crypto: xor - use ktime for template benchmarking

 crypto/xor.c | 67 +++++++++++++-------
 1 file changed, 44 insertions(+), 23 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 1/2] crypto: xor - defer load time benchmark to a later time
  2020-09-26 10:26 [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
@ 2020-09-26 10:26 ` Ard Biesheuvel
  2020-09-26 10:26 ` [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
  2020-10-02 11:55 ` [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Herbert Xu
  2 siblings, 0 replies; 5+ messages in thread
From: Ard Biesheuvel @ 2020-09-26 10:26 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Currently, the XOR module performs its boot time benchmark at core
initcall time when it is built-in, to ensure that the RAID code can
make use of it when it is built-in as well.

Let's defer this to a later stage during the boot, to avoid impacting
the overall boot time of the system. Instead, just pick an arbitrary
implementation from the list, and use that as the preliminary default.

Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/xor.c | 29 +++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/crypto/xor.c b/crypto/xor.c
index ea7349e6ed23..b42c38343733 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -54,6 +54,28 @@ EXPORT_SYMBOL(xor_blocks);
 /* Set of all registered templates.  */
 static struct xor_block_template *__initdata template_list;
 
+#ifndef MODULE
+static void __init do_xor_register(struct xor_block_template *tmpl)
+{
+	tmpl->next = template_list;
+	template_list = tmpl;
+}
+
+static int __init register_xor_blocks(void)
+{
+	active_template = XOR_SELECT_TEMPLATE(NULL);
+
+	if (!active_template) {
+#define xor_speed	do_xor_register
+		// register all the templates and pick the first as the default
+		XOR_TRY_TEMPLATES;
+#undef xor_speed
+		active_template = template_list;
+	}
+	return 0;
+}
+#endif
+
 #define BENCH_SIZE (PAGE_SIZE)
 
 static void __init
@@ -129,6 +151,7 @@ calibrate_xor_blocks(void)
 #define xor_speed(templ)	do_xor_speed((templ), b1, b2)
 
 	printk(KERN_INFO "xor: measuring software checksum speed\n");
+	template_list = NULL;
 	XOR_TRY_TEMPLATES;
 	fastest = template_list;
 	for (f = fastest; f; f = f->next)
@@ -150,6 +173,10 @@ static __exit void xor_exit(void) { }
 
 MODULE_LICENSE("GPL");
 
+#ifndef MODULE
 /* when built-in xor.o must initialize before drivers/md/md.o */
-core_initcall(calibrate_xor_blocks);
+core_initcall(register_xor_blocks);
+#endif
+
+module_init(calibrate_xor_blocks);
 module_exit(xor_exit);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-26 10:26 [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
  2020-09-26 10:26 ` [PATCH v2 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
@ 2020-09-26 10:26 ` Ard Biesheuvel
  2020-09-28 23:47   ` Doug Anderson
  2020-10-02 11:55 ` [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Herbert Xu
  2 siblings, 1 reply; 5+ messages in thread
From: Ard Biesheuvel @ 2020-09-26 10:26 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Currently, we use the jiffies counter as a time source, by staring at
it until a HZ period elapses, and then staring at it again and perform
as many XOR operations as we can at the same time until another HZ
period elapses, so that we can calculate the throughput. This takes
longer than necessary, and depends on HZ, which is undesirable, since
HZ is system dependent.

Let's use the ktime interface instead, and use it to time a fixed
number of XOR operations, which can be done much faster, and makes
the time spent depend on the performance level of the system itself,
which is much more reasonable. To ensure that we have the resolution
we need even on systems with 32 kHz time sources, while not spending too
much time in the benchmark on a slow CPU, let's switch to 3 attempts of
800 repetitions each: that way, we will only misidentify algorithms that
perform within 10% of each other as the fastest if they are faster than
10 GB/s to begin with, which is not expected to occur on systems with
such coarse clocks.

On ThunderX2, I get the following results:

Before:

  [72625.956765] xor: measuring software checksum speed
  [72625.993104]    8regs     : 10169.000 MB/sec
  [72626.033099]    32regs    : 12050.000 MB/sec
  [72626.073095]    arm64_neon: 11100.000 MB/sec
  [72626.073097] xor: using function: 32regs (12050.000 MB/sec)

After:

  [72599.650216] xor: measuring software checksum speed
  [72599.651188]    8regs           : 10491 MB/sec
  [72599.652006]    32regs          : 12345 MB/sec
  [72599.652871]    arm64_neon      : 11402 MB/sec
  [72599.652873] xor: using function: 32regs (12345 MB/sec)

Link: https://lore.kernel.org/linux-crypto/20200923182230.22715-3-ardb@kernel.org/
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/xor.c | 38 +++++++++-----------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/crypto/xor.c b/crypto/xor.c
index b42c38343733..a0badbc03577 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
 }
 #endif
 
-#define BENCH_SIZE (PAGE_SIZE)
+#define BENCH_SIZE	4096
+#define REPS		800U
 
 static void __init
 do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 {
 	int speed;
-	unsigned long now, j;
-	int i, count, max;
+	int i, j, count;
+	ktime_t min, start, diff;
 
 	tmpl->next = template_list;
 	template_list = tmpl;
 
 	preempt_disable();
 
-	/*
-	 * Count the number of XORs done during a whole jiffy, and use
-	 * this to calculate the speed of checksumming.  We use a 2-page
-	 * allocation to have guaranteed color L1-cache layout.
-	 */
-	max = 0;
-	for (i = 0; i < 5; i++) {
-		j = jiffies;
-		count = 0;
-		while ((now = jiffies) == j)
-			cpu_relax();
-		while (time_before(jiffies, now + 1)) {
+	min = (ktime_t)S64_MAX;
+	for (i = 0; i < 3; i++) {
+		start = ktime_get();
+		for (j = 0; j < REPS; j++) {
 			mb(); /* prevent loop optimzation */
 			tmpl->do_2(BENCH_SIZE, b1, b2);
 			mb();
 			count++;
 			mb();
 		}
-		if (count > max)
-			max = count;
+		diff = ktime_sub(ktime_get(), start);
+		if (diff < min)
+			min = diff;
 	}
 
 	preempt_enable();
 
-	speed = max * (HZ * BENCH_SIZE / 1024);
+	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
+	speed = (1000 * REPS * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
 	tmpl->speed = speed;
 
-	printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
-	       speed / 1000, speed % 1000);
+	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);
 }
 
 static int __init
@@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
 		if (f->speed > fastest->speed)
 			fastest = f;
 
-	printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
-	       fastest->name, fastest->speed / 1000, fastest->speed % 1000);
+	pr_info("xor: using function: %s (%d MB/sec)\n",
+	       fastest->name, fastest->speed);
 
 #undef xor_speed
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-26 10:26 ` [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
@ 2020-09-28 23:47   ` Doug Anderson
  0 siblings, 0 replies; 5+ messages in thread
From: Doug Anderson @ 2020-09-28 23:47 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

Hi,

On Sat, Sep 26, 2020 at 3:27 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Currently, we use the jiffies counter as a time source, by staring at
> it until a HZ period elapses, and then staring at it again and perform
> as many XOR operations as we can at the same time until another HZ
> period elapses, so that we can calculate the throughput. This takes
> longer than necessary, and depends on HZ, which is undesirable, since
> HZ is system dependent.
>
> Let's use the ktime interface instead, and use it to time a fixed
> number of XOR operations, which can be done much faster, and makes
> the time spent depend on the performance level of the system itself,
> which is much more reasonable. To ensure that we have the resolution
> we need even on systems with 32 kHz time sources, while not spending too
> much time in the benchmark on a slow CPU, let's switch to 3 attempts of
> 800 repetitions each: that way, we will only misidentify algorithms that
> perform within 10% of each other as the fastest if they are faster than
> 10 GB/s to begin with, which is not expected to occur on systems with
> such coarse clocks.
>
> On ThunderX2, I get the following results:
>
> Before:
>
>   [72625.956765] xor: measuring software checksum speed
>   [72625.993104]    8regs     : 10169.000 MB/sec
>   [72626.033099]    32regs    : 12050.000 MB/sec
>   [72626.073095]    arm64_neon: 11100.000 MB/sec
>   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
>
> After:
>
>   [72599.650216] xor: measuring software checksum speed
>   [72599.651188]    8regs           : 10491 MB/sec
>   [72599.652006]    32regs          : 12345 MB/sec
>   [72599.652871]    arm64_neon      : 11402 MB/sec
>   [72599.652873] xor: using function: 32regs (12345 MB/sec)

What are the chances of 12345 coming up?  ;-)

>
> Link: https://lore.kernel.org/linux-crypto/20200923182230.22715-3-ardb@kernel.org/
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  crypto/xor.c | 38 +++++++++-----------
>  1 file changed, 16 insertions(+), 22 deletions(-)

This looks good to me.  Thanks for taking this on!

Reviewed-by: Douglas Anderson <dianders@chromium.org>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark
  2020-09-26 10:26 [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
  2020-09-26 10:26 ` [PATCH v2 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
  2020-09-26 10:26 ` [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
@ 2020-10-02 11:55 ` Herbert Xu
  2 siblings, 0 replies; 5+ messages in thread
From: Herbert Xu @ 2020-10-02 11:55 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-crypto, Douglas Anderson, David Laight

On Sat, Sep 26, 2020 at 12:26:49PM +0200, Ard Biesheuvel wrote:
> Doug reports [0] that the XOR boot time benchmark takes more time than
> necessary, and runs at a time when there is little room for other
> boot time tasks to run concurrently.
> 
> Let's fix this by #1 deferring the benchmark, and #2 uses a faster
> implementation.
> 
> Changes since v2:
> - incorporate Doug's review feedback re coarse clocks and the use of pr_info
> - add Doug's ack to #1
> 
> [0] https://lore.kernel.org/linux-arm-kernel/20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid/
> 
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: David Laight <David.Laight@aculab.com>
> 
> Ard Biesheuvel (2):
>   crypto: xor - defer load time benchmark to a later time
>   crypto: xor - use ktime for template benchmarking
> 
>  crypto/xor.c | 67 +++++++++++++-------
>  1 file changed, 44 insertions(+), 23 deletions(-)

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-02 11:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-26 10:26 [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
2020-09-26 10:26 ` [PATCH v2 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
2020-09-26 10:26 ` [PATCH v2 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
2020-09-28 23:47   ` Doug Anderson
2020-10-02 11:55 ` [PATCH v2 0/2] crypto: xor - defer and optimize boot time benchmark Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.