linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] lz4: Fix kernel decompression speed
@ 2020-08-03 19:40 Nick Terrell
  2020-08-03 21:57 ` Arvind Sankar
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Terrell @ 2020-08-03 19:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Kernel Team, Nick Terrell, Nick Terrell,
	Yann Collet, Gao Xiang, Sven Schmidt, Andrew Morton,
	Greg Kroah-Hartman, Linus Torvalds

From: Nick Terrell <terrelln@fb.com>

This patch replaces all memcpy() calls with LZ4_memcpy() which calls
__builtin_memcpy() so the compiler can inline it.

LZ4 relies heavily on memcpy() with a constant size being inlined. In
x86 and i386 pre-boot environments memcpy() cannot be inlined because
memcpy() doesn't get defined as __builtin_memcpy().

An equivalent patch has been applied upstream so that the next import
won't lose this change [1].

I've measured the kernel decompression speed using QEMU before and after
this patch for the x86_64 and i386 architectures. The speed-up is about
10x as shown below.

Code	Arch	Kernel Size	Time	Speed
v5.8	x86_64	11504832 B	148 ms	 79 MB/s
patch	x86_64	11503872 B	 13 ms	885 MB/s
v5.8	i386	 9621216 B	 91 ms	106 MB/s
patch	i386	 9620224 B	 10 ms	962 MB/s

I also measured the time to decompress the initramfs on x86_64, i386,
and arm. All three show the same decompression speed before and after,
as expected.

[1] https://github.com/lz4/lz4/pull/890

Signed-off-by: Nick Terrell <terrelln@fb.com>
Cc: Yann Collet <yann.collet.73@gmail.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Sven Schmidt <4sschmid@informatik.uni-hamburg.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 lib/lz4/lz4_compress.c   |  4 ++--
 lib/lz4/lz4_decompress.c | 18 +++++++++---------
 lib/lz4/lz4defs.h        | 10 ++++++++++
 lib/lz4/lz4hc_compress.c |  2 +-
 4 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
index cc7b6d4cc7c7..90bb67994688 100644
--- a/lib/lz4/lz4_compress.c
+++ b/lib/lz4/lz4_compress.c
@@ -446,7 +446,7 @@ static FORCE_INLINE int LZ4_compress_generic(
 			*op++ = (BYTE)(lastRun << ML_BITS);
 		}
 
-		memcpy(op, anchor, lastRun);
+		LZ4_memcpy(op, anchor, lastRun);
 
 		op += lastRun;
 	}
@@ -708,7 +708,7 @@ static int LZ4_compress_destSize_generic(
 		} else {
 			*op++ = (BYTE)(lastRunSize<<ML_BITS);
 		}
-		memcpy(op, anchor, lastRunSize);
+		LZ4_memcpy(op, anchor, lastRunSize);
 		op += lastRunSize;
 	}
 
diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
index 5371dab6b481..00cb0d0b73e1 100644
--- a/lib/lz4/lz4_decompress.c
+++ b/lib/lz4/lz4_decompress.c
@@ -153,7 +153,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 		   && likely((endOnInput ? ip < shortiend : 1) &
 			     (op <= shortoend))) {
 			/* Copy the literals */
-			memcpy(op, ip, endOnInput ? 16 : 8);
+			LZ4_memcpy(op, ip, endOnInput ? 16 : 8);
 			op += length; ip += length;
 
 			/*
@@ -172,9 +172,9 @@ static FORCE_INLINE int LZ4_decompress_generic(
 			    (offset >= 8) &&
 			    (dict == withPrefix64k || match >= lowPrefix)) {
 				/* Copy the match. */
-				memcpy(op + 0, match + 0, 8);
-				memcpy(op + 8, match + 8, 8);
-				memcpy(op + 16, match + 16, 2);
+				LZ4_memcpy(op + 0, match + 0, 8);
+				LZ4_memcpy(op + 8, match + 8, 8);
+				LZ4_memcpy(op + 16, match + 16, 2);
 				op += length + MINMATCH;
 				/* Both stages worked, load the next token. */
 				continue;
@@ -263,7 +263,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 				}
 			}
 
-			memcpy(op, ip, length);
+			LZ4_memcpy(op, ip, length);
 			ip += length;
 			op += length;
 
@@ -350,7 +350,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 				size_t const copySize = (size_t)(lowPrefix - match);
 				size_t const restSize = length - copySize;
 
-				memcpy(op, dictEnd - copySize, copySize);
+				LZ4_memcpy(op, dictEnd - copySize, copySize);
 				op += copySize;
 				if (restSize > (size_t)(op - lowPrefix)) {
 					/* overlap copy */
@@ -360,7 +360,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 					while (op < endOfMatch)
 						*op++ = *copyFrom++;
 				} else {
-					memcpy(op, lowPrefix, restSize);
+					LZ4_memcpy(op, lowPrefix, restSize);
 					op += restSize;
 				}
 			}
@@ -386,7 +386,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 				while (op < copyEnd)
 					*op++ = *match++;
 			} else {
-				memcpy(op, match, mlen);
+				LZ4_memcpy(op, match, mlen);
 			}
 			op = copyEnd;
 			if (op == oend)
@@ -400,7 +400,7 @@ static FORCE_INLINE int LZ4_decompress_generic(
 			op[2] = match[2];
 			op[3] = match[3];
 			match += inc32table[offset];
-			memcpy(op + 4, match, 4);
+			LZ4_memcpy(op + 4, match, 4);
 			match -= dec64table[offset];
 		} else {
 			LZ4_copy8(op, match);
diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
index 1a7fa9d9170f..c91dd96ef629 100644
--- a/lib/lz4/lz4defs.h
+++ b/lib/lz4/lz4defs.h
@@ -137,6 +137,16 @@ static FORCE_INLINE void LZ4_writeLE16(void *memPtr, U16 value)
 	return put_unaligned_le16(value, memPtr);
 }
 
+/*
+ * LZ4 relies on memcpy with a constant size being inlined. In freestanding
+ * environments, the compiler can't assume the implementation of memcpy() is
+ * standard compliant, so apply its specialized memcpy() inlining logic. When
+ * possible, use __builtin_memcpy() to tell the compiler to analyze memcpy()
+ * as-if it were standard compliant, so it can inline it in freestanding
+ * environments. This is needed when decompressing the Linux Kernel, for example.
+ */
+#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
+
 static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
 {
 #if LZ4_ARCH64
diff --git a/lib/lz4/lz4hc_compress.c b/lib/lz4/lz4hc_compress.c
index 1b61d874e337..e7ac8694b797 100644
--- a/lib/lz4/lz4hc_compress.c
+++ b/lib/lz4/lz4hc_compress.c
@@ -570,7 +570,7 @@ static int LZ4HC_compress_generic(
 			*op++ = (BYTE) lastRun;
 		} else
 			*op++ = (BYTE)(lastRun<<ML_BITS);
-		memcpy(op, anchor, iend - anchor);
+		LZ4_memcpy(op, anchor, iend - anchor);
 		op += iend - anchor;
 	}
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-03 19:40 [PATCH] lz4: Fix kernel decompression speed Nick Terrell
@ 2020-08-03 21:57 ` Arvind Sankar
  2020-08-03 22:55   ` Nick Terrell
  0 siblings, 1 reply; 14+ messages in thread
From: Arvind Sankar @ 2020-08-03 21:57 UTC (permalink / raw)
  To: Nick Terrell
  Cc: Ingo Molnar, linux-kernel, x86, Kernel Team, Nick Terrell,
	Yann Collet, Gao Xiang, Sven Schmidt, Andrew Morton,
	Greg Kroah-Hartman, Linus Torvalds

On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
> From: Nick Terrell <terrelln@fb.com>
> 
> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> __builtin_memcpy() so the compiler can inline it.
> 
> LZ4 relies heavily on memcpy() with a constant size being inlined. In
> x86 and i386 pre-boot environments memcpy() cannot be inlined because
> memcpy() doesn't get defined as __builtin_memcpy().
> 
> An equivalent patch has been applied upstream so that the next import
> won't lose this change [1].
> 
> I've measured the kernel decompression speed using QEMU before and after
> this patch for the x86_64 and i386 architectures. The speed-up is about
> 10x as shown below.
> 
> Code	Arch	Kernel Size	Time	Speed
> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
> patch	x86_64	11503872 B	 13 ms	885 MB/s
> v5.8	i386	 9621216 B	 91 ms	106 MB/s
> patch	i386	 9620224 B	 10 ms	962 MB/s
> 
> I also measured the time to decompress the initramfs on x86_64, i386,
> and arm. All three show the same decompression speed before and after,
> as expected.
> 
> [1] https://github.com/lz4/lz4/pull/890
> 

Hi Nick, would you be able to test the below patch's performance to
verify it gives the same speedup? It removes the #undef in misc.c which
causes the decompressors to not use the builtin version. It should be
equivalent to yours except for applying it to all the decompressors.

Thanks.

From 10f8d939fc367e3127e2d72ba099678debcae422 Mon Sep 17 00:00:00 2001
From: Arvind Sankar <nivedita@alum.mit.edu>
Date: Mon, 3 Aug 2020 17:07:37 -0400
Subject: [PATCH] x86/boot/compressed: Use builtin mem functions for decompressor

Since commits

  c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
  fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")

the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.

Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.

The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.

Replace the comment with an explanation of why memzero and memmove are
being #define'd.

Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
---
 arch/x86/boot/compressed/misc.c | 7 ++-----
 arch/x86/boot/string.h          | 5 +----
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9652d5c2afda..0c74a6e526b6 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -30,12 +30,9 @@
 #define STATIC		static
 
 /*
- * Use normal definitions of mem*() from string.c. There are already
- * included header files which expect a definition of memset() and by
- * the time we define memset macro, it is too late.
+ * Provide definitions of memzero and memmove as some of the decompressors will
+ * try to define their own functions if these are not defined as macros.
  */
-#undef memcpy
-#undef memset
 #define memzero(s, n)	memset((s), 0, (n))
 #define memmove		memmove
 
diff --git a/arch/x86/boot/string.h b/arch/x86/boot/string.h
index 995f7b7ad512..a232da487cd2 100644
--- a/arch/x86/boot/string.h
+++ b/arch/x86/boot/string.h
@@ -11,10 +11,7 @@ void *memcpy(void *dst, const void *src, size_t len);
 void *memset(void *dst, int c, size_t len);
 int memcmp(const void *s1, const void *s2, size_t len);
 
-/*
- * Access builtin version by default. If one needs to use optimized version,
- * do "undef memcpy" in .c file and link against right string.c
- */
+/* Access builtin version by default. */
 #define memcpy(d,s,l) __builtin_memcpy(d,s,l)
 #define memset(d,c,l) __builtin_memset(d,c,l)
 #define memcmp	__builtin_memcmp
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-03 21:57 ` Arvind Sankar
@ 2020-08-03 22:55   ` Nick Terrell
  2020-08-04  1:56     ` Arvind Sankar
  2020-08-04  8:32     ` [PATCH] lz4: Fix kernel decompression speed Pavel Machek
  0 siblings, 2 replies; 14+ messages in thread
From: Nick Terrell @ 2020-08-03 22:55 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Nick Terrell, Ingo Molnar, Linux Kernel Mailing List, X86 ML,
	Kernel Team, Yann Collet, Gao Xiang, Sven Schmidt, Andrew Morton,
	Greg Kroah-Hartman, Linus Torvalds



> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> 
> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>> From: Nick Terrell <terrelln@fb.com>
>> 
>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>> __builtin_memcpy() so the compiler can inline it.
>> 
>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>> memcpy() doesn't get defined as __builtin_memcpy().
>> 
>> An equivalent patch has been applied upstream so that the next import
>> won't lose this change [1].
>> 
>> I've measured the kernel decompression speed using QEMU before and after
>> this patch for the x86_64 and i386 architectures. The speed-up is about
>> 10x as shown below.
>> 
>> Code	Arch	Kernel Size	Time	Speed
>> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
>> patch	x86_64	11503872 B	 13 ms	885 MB/s
>> v5.8	i386	 9621216 B	 91 ms	106 MB/s
>> patch	i386	 9620224 B	 10 ms	962 MB/s
>> 
>> I also measured the time to decompress the initramfs on x86_64, i386,
>> and arm. All three show the same decompression speed before and after,
>> as expected.
>> 
>> [1] https://github.com/lz4/lz4/pull/890
>> 
> 
> Hi Nick, would you be able to test the below patch's performance to
> verify it gives the same speedup? It removes the #undef in misc.c which
> causes the decompressors to not use the builtin version. It should be
> equivalent to yours except for applying it to all the decompressors.
> 
> Thanks.

I will measure it. I would expect it to provide the same speed up. It would be great to fix
the problem for x86/i386 in general.

But, I believe that this is also a problem for ARM, though I have a hard time measuring
because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
assembly, because I’m fairly certain that memcpy() isn’t inlined in master.

Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
is pretty easy to merge a patch that is a boot speed regression, because people aren’t
actively measuring it. So I prefer a layered defense.

Additionally, this is following upstream. At some point LZ4 will be imported from upstream
and get an equivalent of this patch https://github.com/lz4/lz4/pull/890.

Best,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-03 22:55   ` Nick Terrell
@ 2020-08-04  1:56     ` Arvind Sankar
  2020-08-04  2:57       ` Nick Terrell
  2020-08-04 17:59       ` Nick Terrell
  2020-08-04  8:32     ` [PATCH] lz4: Fix kernel decompression speed Pavel Machek
  1 sibling, 2 replies; 14+ messages in thread
From: Arvind Sankar @ 2020-08-04  1:56 UTC (permalink / raw)
  To: Nick Terrell
  Cc: Arvind Sankar, Nick Terrell, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Kernel Team, Yann Collet,
	Gao Xiang, Sven Schmidt, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds

On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
> 
> 
> > On Aug 3, 2020, at 2:57 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > 
> > On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
> >> From: Nick Terrell <terrelln@fb.com>
> >> 
> >> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> >> __builtin_memcpy() so the compiler can inline it.
> >> 
> >> LZ4 relies heavily on memcpy() with a constant size being inlined. In
> >> x86 and i386 pre-boot environments memcpy() cannot be inlined because
> >> memcpy() doesn't get defined as __builtin_memcpy().
> >> 
> >> An equivalent patch has been applied upstream so that the next import
> >> won't lose this change [1].
> >> 
> >> I've measured the kernel decompression speed using QEMU before and after
> >> this patch for the x86_64 and i386 architectures. The speed-up is about
> >> 10x as shown below.
> >> 
> >> Code	Arch	Kernel Size	Time	Speed
> >> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
> >> patch	x86_64	11503872 B	 13 ms	885 MB/s
> >> v5.8	i386	 9621216 B	 91 ms	106 MB/s
> >> patch	i386	 9620224 B	 10 ms	962 MB/s
> >> 
> >> I also measured the time to decompress the initramfs on x86_64, i386,
> >> and arm. All three show the same decompression speed before and after,
> >> as expected.
> >> 
> >> [1] https://github.com/lz4/lz4/pull/890
> >> 
> > 
> > Hi Nick, would you be able to test the below patch's performance to
> > verify it gives the same speedup? It removes the #undef in misc.c which
> > causes the decompressors to not use the builtin version. It should be
> > equivalent to yours except for applying it to all the decompressors.
> > 
> > Thanks.
> 
> I will measure it. I would expect it to provide the same speed up. It would be great to fix
> the problem for x86/i386 in general.

Thanks. I tried using RDTSC to get some timings under QEMU, and I get
similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
(on 64-bit) -- I see that ZSTD_copy8 is already using __builtin_memcpy,
but there must be more that can be optimized? There's a couple 1/2-byte
sized copies in huf_decompress.c.

For the rest of the algos there seems to be no difference, likely
because some casual grepping doesn't show any constant-size copies in
the other algorithms.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-04  1:56     ` Arvind Sankar
@ 2020-08-04  2:57       ` Nick Terrell
  2020-08-04 15:19         ` Arvind Sankar
  2020-08-04 17:59       ` Nick Terrell
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Terrell @ 2020-08-04  2:57 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Nick Terrell, Ingo Molnar, Linux Kernel Mailing List, X86 ML,
	Kernel Team, Yann Collet, Gao Xiang, Sven Schmidt, Andrew Morton,
	Greg Kroah-Hartman, Linus Torvalds



> On Aug 3, 2020, at 6:56 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> 
> On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
>> 
>> 
>>> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>>> 
>>> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>>>> From: Nick Terrell <terrelln@fb.com>
>>>> 
>>>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>>>> __builtin_memcpy() so the compiler can inline it.
>>>> 
>>>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>>>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>>>> memcpy() doesn't get defined as __builtin_memcpy().
>>>> 
>>>> An equivalent patch has been applied upstream so that the next import
>>>> won't lose this change [1].
>>>> 
>>>> I've measured the kernel decompression speed using QEMU before and after
>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>> 10x as shown below.
>>>> 
>>>> Code	Arch	Kernel Size	Time	Speed
>>>> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
>>>> patch	x86_64	11503872 B	 13 ms	885 MB/s
>>>> v5.8	i386	 9621216 B	 91 ms	106 MB/s
>>>> patch	i386	 9620224 B	 10 ms	962 MB/s
>>>> 
>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>> and arm. All three show the same decompression speed before and after,
>>>> as expected.
>>>> 
>>>> [1] https://github.com/lz4/lz4/pull/890
>>>> 
>>> 
>>> Hi Nick, would you be able to test the below patch's performance to
>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>> causes the decompressors to not use the builtin version. It should be
>>> equivalent to yours except for applying it to all the decompressors.
>>> 
>>> Thanks.
>> 
>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>> the problem for x86/i386 in general.
> 
> Thanks. I tried using RDTSC to get some timings under QEMU, and I get
> similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
> (on 64-bit)

By the way, I was using this script for performance testing [0].

> -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> but there must be more that can be optimized? There's a couple 1/2-byte
> sized copies in huf_decompress.c.

Oh wow, I totally missed that, I guess I stopped looking once performance
was about what I expected, nice find!

I suspect it is mostly the memcpy inside of HUF_decodeSymbolX4(), since
that should be the only hot one [1].

Do you want to put up the patch to fix the memcpy’s in zstd Huffman, or should I?

I will be submitting a patch upstream to migrate all of zstd’s memcpy() calls to
use __builtin_memcpy(), since I plan on updating the version in the kernel to
upstream zstd in the next few months. I was waiting until the compressed kernel
patch set landed, so I didn't distract from it.

[0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2
[1] https://github.com/torvalds/linux/blob/master/lib/zstd/huf_decompress.c#L598

Best,
Nick


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-03 22:55   ` Nick Terrell
  2020-08-04  1:56     ` Arvind Sankar
@ 2020-08-04  8:32     ` Pavel Machek
  2020-08-04 15:16       ` Arvind Sankar
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Machek @ 2020-08-04  8:32 UTC (permalink / raw)
  To: Nick Terrell
  Cc: Arvind Sankar, Nick Terrell, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Kernel Team, Yann Collet,
	Gao Xiang, Sven Schmidt, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

Hi!

> >> I've measured the kernel decompression speed using QEMU before and after
> >> this patch for the x86_64 and i386 architectures. The speed-up is about
> >> 10x as shown below.
> >> 
> >> Code	Arch	Kernel Size	Time	Speed
> >> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
> >> patch	x86_64	11503872 B	 13 ms	885 MB/s
> >> v5.8	i386	 9621216 B	 91 ms	106 MB/s
> >> patch	i386	 9620224 B	 10 ms	962 MB/s
> >> 
> >> I also measured the time to decompress the initramfs on x86_64, i386,
> >> and arm. All three show the same decompression speed before and after,
> >> as expected.
> >> 
> >> [1] https://github.com/lz4/lz4/pull/890
> >> 
> > 
> > Hi Nick, would you be able to test the below patch's performance to
> > verify it gives the same speedup? It removes the #undef in misc.c which
> > causes the decompressors to not use the builtin version. It should be
> > equivalent to yours except for applying it to all the decompressors.
> > 
> > Thanks.
> 
> I will measure it. I would expect it to provide the same speed up. It would be great to fix
> the problem for x86/i386 in general.
> 
> But, I believe that this is also a problem for ARM, though I have a hard time measuring
> because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
> assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
> 
> Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
> is pretty easy to merge a patch that is a boot speed regression, because people aren’t
> actively measuring it. So I prefer a layered defense.


Layered defense against performance-only problem, happening on
emulation-only?

IMO that's a bit of overkill.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-04  8:32     ` [PATCH] lz4: Fix kernel decompression speed Pavel Machek
@ 2020-08-04 15:16       ` Arvind Sankar
  2020-08-04 17:18         ` Nick Terrell
  0 siblings, 1 reply; 14+ messages in thread
From: Arvind Sankar @ 2020-08-04 15:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nick Terrell, Arvind Sankar, Nick Terrell, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Kernel Team, Yann Collet,
	Gao Xiang, Sven Schmidt, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds

On Tue, Aug 04, 2020 at 10:32:36AM +0200, Pavel Machek wrote:
> Hi!
> 
> > >> I've measured the kernel decompression speed using QEMU before and after
> > >> this patch for the x86_64 and i386 architectures. The speed-up is about
> > >> 10x as shown below.
> > >> 
> > >> Code	Arch	Kernel Size	Time	Speed
> > >> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
> > >> patch	x86_64	11503872 B	 13 ms	885 MB/s
> > >> v5.8	i386	 9621216 B	 91 ms	106 MB/s
> > >> patch	i386	 9620224 B	 10 ms	962 MB/s
> > >> 
> > >> I also measured the time to decompress the initramfs on x86_64, i386,
> > >> and arm. All three show the same decompression speed before and after,
> > >> as expected.
> > >> 
> > >> [1] https://github.com/lz4/lz4/pull/890
> > >> 
> > > 
> > > Hi Nick, would you be able to test the below patch's performance to
> > > verify it gives the same speedup? It removes the #undef in misc.c which
> > > causes the decompressors to not use the builtin version. It should be
> > > equivalent to yours except for applying it to all the decompressors.
> > > 
> > > Thanks.
> > 
> > I will measure it. I would expect it to provide the same speed up. It would be great to fix
> > the problem for x86/i386 in general.
> > 
> > But, I believe that this is also a problem for ARM, though I have a hard time measuring
> > because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
> > assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
> > 
> > Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
> > is pretty easy to merge a patch that is a boot speed regression, because people aren’t
> > actively measuring it. So I prefer a layered defense.
> 
> 
> Layered defense against performance-only problem, happening on
> emulation-only?
> 
> IMO that's a bit of overkill.
> 
> Best regards,
> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Why would it be emulation-only? QEMU is just being used for ease of
testing, but the performance impact should be similar on bare metal.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-04  2:57       ` Nick Terrell
@ 2020-08-04 15:19         ` Arvind Sankar
  0 siblings, 0 replies; 14+ messages in thread
From: Arvind Sankar @ 2020-08-04 15:19 UTC (permalink / raw)
  To: Nick Terrell
  Cc: Arvind Sankar, Nick Terrell, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Kernel Team, Yann Collet,
	Gao Xiang, Sven Schmidt, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds

On Tue, Aug 04, 2020 at 02:57:50AM +0000, Nick Terrell wrote:
> 
> 
> > On Aug 3, 2020, at 6:56 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > 
> 
> > -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> > but there must be more that can be optimized? There's a couple 1/2-byte
> > sized copies in huf_decompress.c.
> 
> Oh wow, I totally missed that, I guess I stopped looking once performance
> was about what I expected, nice find!
> 
> I suspect it is mostly the memcpy inside of HUF_decodeSymbolX4(), since
> that should be the only hot one [1].
> 
> Do you want to put up the patch to fix the memcpy’s in zstd Huffman, or should I?
> 
> I will be submitting a patch upstream to migrate all of zstd’s memcpy() calls to
> use __builtin_memcpy(), since I plan on updating the version in the kernel to
> upstream zstd in the next few months. I was waiting until the compressed kernel
> patch set landed, so I didn't distract from it.
> 
> [0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2
> [1] https://github.com/torvalds/linux/blob/master/lib/zstd/huf_decompress.c#L598
> 
> Best,
> Nick
> 

It's better if you do the zstd changes I think, as I'm not familiar with
the code at all.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-04 15:16       ` Arvind Sankar
@ 2020-08-04 17:18         ` Nick Terrell
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Terrell @ 2020-08-04 17:18 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Pavel Machek, Nick Terrell, Ingo Molnar,
	Linux Kernel Mailing List, X86 ML, Kernel Team, Yann Collet,
	Gao Xiang, Sven Schmidt, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds



> On Aug 4, 2020, at 8:16 AM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> 
> On Tue, Aug 04, 2020 at 10:32:36AM +0200, Pavel Machek wrote:
>> Hi!
>> 
>>>>> I've measured the kernel decompression speed using QEMU before and after
>>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>>> 10x as shown below.
>>>>> 
>>>>> Code	Arch	Kernel Size	Time	Speed
>>>>> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
>>>>> patch	x86_64	11503872 B	 13 ms	885 MB/s
>>>>> v5.8	i386	 9621216 B	 91 ms	106 MB/s
>>>>> patch	i386	 9620224 B	 10 ms	962 MB/s
>>>>> 
>>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>>> and arm. All three show the same decompression speed before and after,
>>>>> as expected.
>>>>> 
>>>>> [1] https://github.com/lz4/lz4/pull/890
>>>>> 
>>>> 
>>>> Hi Nick, would you be able to test the below patch's performance to
>>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>>> causes the decompressors to not use the builtin version. It should be
>>>> equivalent to yours except for applying it to all the decompressors.
>>>> 
>>>> Thanks.
>>> 
>>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>>> the problem for x86/i386 in general.
>>> 
>>> But, I believe that this is also a problem for ARM, though I have a hard time measuring
>>> because I can’t get pre-boot print statements in QEMU. I will attempt to take a look at the
>>> assembly, because I’m fairly certain that memcpy() isn’t inlined in master.
>>> 
>>> Even if we fix all the architectures, I would still like to merge the LZ4 patch. It seems like it
>>> is pretty easy to merge a patch that is a boot speed regression, because people aren’t
>>> actively measuring it. So I prefer a layered defense.
>> 
>> 
>> Layered defense against performance-only problem, happening on
>> emulation-only?
>> 
>> IMO that's a bit of overkill.
> 
> Why would it be emulation-only? QEMU is just being used for ease of
> testing, but the performance impact should be similar on bare metal.

In addition, I want the decompressors to be fast in the pre-boot for all
architectures. Not everyone is going to know that zstd and lz4 require
memcpy to be inlined or they are 10x slower, that is an implementation
detail of the library. It is a performance gotcha that I’d rather not have.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] lz4: Fix kernel decompression speed
  2020-08-04  1:56     ` Arvind Sankar
  2020-08-04  2:57       ` Nick Terrell
@ 2020-08-04 17:59       ` Nick Terrell
  2020-08-04 23:48         ` [PATCH 0/1] x86/boot/compressed: Use builtin mem functions for decompressor Arvind Sankar
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Terrell @ 2020-08-04 17:59 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Nick Terrell, Ingo Molnar, Linux Kernel Mailing List, X86 ML,
	Kernel Team, Yann Collet, Gao Xiang, Sven Schmidt, Andrew Morton,
	Greg Kroah-Hartman, Linus Torvalds



> On Aug 3, 2020, at 6:56 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
> 
> On Mon, Aug 03, 2020 at 10:55:01PM +0000, Nick Terrell wrote:
>> 
>> 
>>> On Aug 3, 2020, at 2:57 PM, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>>> 
>>> On Mon, Aug 03, 2020 at 12:40:22PM -0700, Nick Terrell wrote:
>>>> From: Nick Terrell <terrelln@fb.com>
>>>> 
>>>> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
>>>> __builtin_memcpy() so the compiler can inline it.
>>>> 
>>>> LZ4 relies heavily on memcpy() with a constant size being inlined. In
>>>> x86 and i386 pre-boot environments memcpy() cannot be inlined because
>>>> memcpy() doesn't get defined as __builtin_memcpy().
>>>> 
>>>> An equivalent patch has been applied upstream so that the next import
>>>> won't lose this change [1].
>>>> 
>>>> I've measured the kernel decompression speed using QEMU before and after
>>>> this patch for the x86_64 and i386 architectures. The speed-up is about
>>>> 10x as shown below.
>>>> 
>>>> Code	Arch	Kernel Size	Time	Speed
>>>> v5.8	x86_64	11504832 B	148 ms	 79 MB/s
>>>> patch	x86_64	11503872 B	 13 ms	885 MB/s
>>>> v5.8	i386	 9621216 B	 91 ms	106 MB/s
>>>> patch	i386	 9620224 B	 10 ms	962 MB/s
>>>> 
>>>> I also measured the time to decompress the initramfs on x86_64, i386,
>>>> and arm. All three show the same decompression speed before and after,
>>>> as expected.
>>>> 
>>>> [1] https://github.com/lz4/lz4/pull/890
>>>> 
>>> 
>>> Hi Nick, would you be able to test the below patch's performance to
>>> verify it gives the same speedup? It removes the #undef in misc.c which
>>> causes the decompressors to not use the builtin version. It should be
>>> equivalent to yours except for applying it to all the decompressors.
>>> 
>>> Thanks.
>> 
>> I will measure it. I would expect it to provide the same speed up. It would be great to fix
>> the problem for x86/i386 in general.
> 
> Thanks. I tried using RDTSC to get some timings under QEMU, and I get
> similar speedup as you have for LZ4, and around 15-20% or so for ZSTD
> (on 64-bit) -- I see that ZSTD_copy8 is already using __builtin_memcpy,
> but there must be more that can be optimized? There's a couple 1/2-byte
> sized copies in huf_decompress.c.

I tested your patch. I saw the same performance as my patch for LZ4, as expected.
I saw that zstd decompression time went from 84 ms -> 63 ms, a 25% improvement.

I will get started on a zstd patch. I look forward to your x86 patch getting merged!

Thanks,
Nick


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 0/1] x86/boot/compressed: Use builtin mem functions for decompressor
  2020-08-04 17:59       ` Nick Terrell
@ 2020-08-04 23:48         ` Arvind Sankar
  2020-08-04 23:48           ` [PATCH 1/1] " Arvind Sankar
  0 siblings, 1 reply; 14+ messages in thread
From: Arvind Sankar @ 2020-08-04 23:48 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: linux-kernel, Nick Terrell, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds, Yann Collet, Gao Xiang, Sven Schmidt

This patch drops a couple of lines that prevent the decompressor code
from using __builtin_memcpy (and __builtin_memset).

Note that Nick Terrell's patch [0] will apply similar changes directly
in the decompressor code (in-kernel and upstream) for LZ4, and the ZSTD
code already is partially optimized to use __builtin_memcpy for hot
memcpy's, but still improves a little bit more with this patch.

[0] https://lore.kernel.org/lkml/20200803194022.2966806-1-nickrterrell@gmail.com/

Arvind Sankar (1):
  x86/boot/compressed: Use builtin mem functions for decompressor

 arch/x86/boot/compressed/misc.c | 7 ++-----
 arch/x86/boot/string.h          | 5 +----
 2 files changed, 3 insertions(+), 9 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] x86/boot/compressed: Use builtin mem functions for decompressor
  2020-08-04 23:48         ` [PATCH 0/1] x86/boot/compressed: Use builtin mem functions for decompressor Arvind Sankar
@ 2020-08-04 23:48           ` Arvind Sankar
  2020-08-19 18:14             ` Kees Cook
  0 siblings, 1 reply; 14+ messages in thread
From: Arvind Sankar @ 2020-08-04 23:48 UTC (permalink / raw)
  To: Ingo Molnar, x86
  Cc: linux-kernel, Nick Terrell, Andrew Morton, Greg Kroah-Hartman,
	Linus Torvalds, Yann Collet, Gao Xiang, Sven Schmidt

Since commits

  c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
  fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")

the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.

Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.

The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.

Replace the comment with an explanation of why memzero and memmove are
being #define'd.

Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.

Most decompressors' performance is unchanged, with the exception of LZ4
and 64-bit ZSTD.

	Before	After ARCH
LZ4	  73ms	 10ms   32
LZ4	 120ms	 10ms	64
ZSTD	  90ms	 74ms	64

Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.

Decompressor code size has small differences, with the largest being
that 64-bit ZSTD decreases just over 2k. The largest code size increase
was on 64-bit XZ, of about 400 bytes.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Nick Terrell <nickrterrell@gmail.com>
Tested-by: Nick Terrell <nickrterrell@gmail.com>
---
 arch/x86/boot/compressed/misc.c | 7 ++-----
 arch/x86/boot/string.h          | 5 +----
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 39e592d0e0b4..e478e40fbe5a 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -30,12 +30,9 @@
 #define STATIC		static
 
 /*
- * Use normal definitions of mem*() from string.c. There are already
- * included header files which expect a definition of memset() and by
- * the time we define memset macro, it is too late.
+ * Provide definitions of memzero and memmove as some of the decompressors will
+ * try to define their own functions if these are not defined as macros.
  */
-#undef memcpy
-#undef memset
 #define memzero(s, n)	memset((s), 0, (n))
 #define memmove		memmove
 
diff --git a/arch/x86/boot/string.h b/arch/x86/boot/string.h
index 995f7b7ad512..a232da487cd2 100644
--- a/arch/x86/boot/string.h
+++ b/arch/x86/boot/string.h
@@ -11,10 +11,7 @@ void *memcpy(void *dst, const void *src, size_t len);
 void *memset(void *dst, int c, size_t len);
 int memcmp(const void *s1, const void *s2, size_t len);
 
-/*
- * Access builtin version by default. If one needs to use optimized version,
- * do "undef memcpy" in .c file and link against right string.c
- */
+/* Access builtin version by default. */
 #define memcpy(d,s,l) __builtin_memcpy(d,s,l)
 #define memset(d,c,l) __builtin_memset(d,c,l)
 #define memcmp	__builtin_memcmp
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] x86/boot/compressed: Use builtin mem functions for decompressor
  2020-08-04 23:48           ` [PATCH 1/1] " Arvind Sankar
@ 2020-08-19 18:14             ` Kees Cook
  2020-08-19 18:22               ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Kees Cook @ 2020-08-19 18:14 UTC (permalink / raw)
  To: Arvind Sankar, Ingo Molnar, Andrew Morton
  Cc: x86, linux-kernel, Nick Terrell, Greg Kroah-Hartman,
	Linus Torvalds, Yann Collet, Gao Xiang, Sven Schmidt

On Tue, Aug 04, 2020 at 07:48:17PM -0400, Arvind Sankar wrote:
> Since commits
> 
>   c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
>   fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")
> 
> the decompressor stub has been using the compiler's builtin memcpy,
> memset and memcmp functions, _except_ where it would likely have the
> largest impact, in the decompression code itself.
> 
> Remove the #undef's of memcpy and memset in misc.c so that the
> decompressor code also uses the compiler builtins.
> 
> The rationale given in the comment doesn't really apply: just because
> some functions use the out-of-line version is no reason to not use the
> builtin version in the rest.
> 
> Replace the comment with an explanation of why memzero and memmove are
> being #define'd.
> 
> Drop the suggestion to #undef in boot/string.h as well: the out-of-line
> versions are not really optimized versions, they're generic code that's
> good enough for the preboot environment. The compiler will likely
> generate better code for constant-size memcpy/memset/memcmp if it is
> allowed to.
> 
> Most decompressors' performance is unchanged, with the exception of LZ4
> and 64-bit ZSTD.
> 
> 	Before	After ARCH
> LZ4	  73ms	 10ms   32
> LZ4	 120ms	 10ms	64
> ZSTD	  90ms	 74ms	64
> 
> Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.
> 
> Decompressor code size has small differences, with the largest being
> that 64-bit ZSTD decreases just over 2k. The largest code size increase
> was on 64-bit XZ, of about 400 bytes.
> 
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Suggested-by: Nick Terrell <nickrterrell@gmail.com>
> Tested-by: Nick Terrell <nickrterrell@gmail.com>

Did anyone pick this up? (Ingo can you snag it, or maybe akpm who took
the LZ4-specific patch already?) This looks sane to me and provides some
surprising performance benefits. :)

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/boot/compressed/misc.c | 7 ++-----
>  arch/x86/boot/string.h          | 5 +----
>  2 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index 39e592d0e0b4..e478e40fbe5a 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -30,12 +30,9 @@
>  #define STATIC		static
>  
>  /*
> - * Use normal definitions of mem*() from string.c. There are already
> - * included header files which expect a definition of memset() and by
> - * the time we define memset macro, it is too late.
> + * Provide definitions of memzero and memmove as some of the decompressors will
> + * try to define their own functions if these are not defined as macros.
>   */
> -#undef memcpy
> -#undef memset
>  #define memzero(s, n)	memset((s), 0, (n))
>  #define memmove		memmove
>  
> diff --git a/arch/x86/boot/string.h b/arch/x86/boot/string.h
> index 995f7b7ad512..a232da487cd2 100644
> --- a/arch/x86/boot/string.h
> +++ b/arch/x86/boot/string.h
> @@ -11,10 +11,7 @@ void *memcpy(void *dst, const void *src, size_t len);
>  void *memset(void *dst, int c, size_t len);
>  int memcmp(const void *s1, const void *s2, size_t len);
>  
> -/*
> - * Access builtin version by default. If one needs to use optimized version,
> - * do "undef memcpy" in .c file and link against right string.c
> - */
> +/* Access builtin version by default. */
>  #define memcpy(d,s,l) __builtin_memcpy(d,s,l)
>  #define memset(d,c,l) __builtin_memset(d,c,l)
>  #define memcmp	__builtin_memcmp
> -- 
> 2.26.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] x86/boot/compressed: Use builtin mem functions for decompressor
  2020-08-19 18:14             ` Kees Cook
@ 2020-08-19 18:22               ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2020-08-19 18:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: Arvind Sankar, Ingo Molnar, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Nick Terrell, Greg Kroah-Hartman, Yann Collet, Gao Xiang,
	Sven Schmidt

On Wed, Aug 19, 2020 at 11:14 AM Kees Cook <keescook@chromium.org> wrote:
>
> Did anyone pick this up? (Ingo can you snag it, or maybe akpm who took
> the LZ4-specific patch already?) This looks sane to me and provides some
> surprising performance benefits. :)

I'll take this and Arvind's freestanding patch too.

             Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-08-19 18:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03 19:40 [PATCH] lz4: Fix kernel decompression speed Nick Terrell
2020-08-03 21:57 ` Arvind Sankar
2020-08-03 22:55   ` Nick Terrell
2020-08-04  1:56     ` Arvind Sankar
2020-08-04  2:57       ` Nick Terrell
2020-08-04 15:19         ` Arvind Sankar
2020-08-04 17:59       ` Nick Terrell
2020-08-04 23:48         ` [PATCH 0/1] x86/boot/compressed: Use builtin mem functions for decompressor Arvind Sankar
2020-08-04 23:48           ` [PATCH 1/1] " Arvind Sankar
2020-08-19 18:14             ` Kees Cook
2020-08-19 18:22               ` Linus Torvalds
2020-08-04  8:32     ` [PATCH] lz4: Fix kernel decompression speed Pavel Machek
2020-08-04 15:16       ` Arvind Sankar
2020-08-04 17:18         ` Nick Terrell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).