All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] x86/misc for 6.5
@ 2023-06-27 11:00 Borislav Petkov
  2023-06-27 20:11 ` Linus Torvalds
  2023-06-27 20:52 ` pr-tracker-bot
  0 siblings, 2 replies; 16+ messages in thread
From: Borislav Petkov @ 2023-06-27 11:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: x86-ml, lkml

Hi Linus,

please pull the misc pile of updates for 6.5.

Thx.

---

The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:

  Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_misc_for_v6.5

for you to fetch changes up to 5516c89d58283413134f8d26960c6303d5d5bd89:

  x86/lib: Make get/put_user() exception handling a visible symbol (2023-06-02 10:51:46 +0200)

----------------------------------------------------------------
- Remove the local symbols prefix of the get/put_user() exception
  handling symbols so that tools do not get confused by the presence of
  code belonging to the wrong symbol/not belonging to any symbol

- Improve csum_partial()'s performance

- Some improvements to the kcpuid tool

----------------------------------------------------------------
Borislav Petkov (AMD) (1):
      tools/x86/kcpuid: Dump the correct CPUID function in error

Nadav Amit (1):
      x86/lib: Make get/put_user() exception handling a visible symbol

Nathan Chancellor (1):
      x86/csum: Fix clang -Wuninitialized in csum_partial()

Noah Goldstein (1):
      x86/csum: Improve performance of `csum_partial`

Rong Tao (1):
      tools/x86/kcpuid: Add .gitignore

 arch/x86/lib/csum-partial_64.c   | 101 ++++++++----
 arch/x86/lib/getuser.S           |  32 ++--
 arch/x86/lib/putuser.S           |  24 +--
 lib/Kconfig.debug                |  17 ++
 lib/Makefile                     |   1 +
 lib/checksum_kunit.c             | 334 +++++++++++++++++++++++++++++++++++++++
 tools/arch/x86/kcpuid/.gitignore |   1 +
 tools/arch/x86/kcpuid/kcpuid.c   |   7 +-
 8 files changed, 453 insertions(+), 64 deletions(-)
 create mode 100644 lib/checksum_kunit.c
 create mode 100644 tools/arch/x86/kcpuid/.gitignore

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 11:00 [GIT PULL] x86/misc for 6.5 Borislav Petkov
@ 2023-06-27 20:11 ` Linus Torvalds
  2023-06-27 20:26   ` Linus Torvalds
  2023-06-27 21:44   ` Arjan van de Ven
  2023-06-27 20:52 ` pr-tracker-bot
  1 sibling, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 20:11 UTC (permalink / raw)
  To: Borislav Petkov, Noah Goldstein, Dave Hansen; +Cc: x86-ml, lkml

[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]

On Tue, 27 Jun 2023 at 04:00, Borislav Petkov <bp@alien8.de> wrote:
>
> - Improve csum_partial()'s performance

Honestly, looking at that patch, my reaction is "why did it get
unrolled in 64-byte chunks, if 40 bytes is the magic value"?

Particularly when there is then that "do a carry op each 32 bytes to
make 32-byte chunks independent and increase ILP". So even the 64-byte
case isn't *actuall* doing a 64-byte unrolling, it's really doing two
32-byte unrollings in parallel.

So you have three "magic" values, and the only one that really matters
is likely the 40-byte one.

Yes, yes, 64 bytes is the usual cacheline size, and is "traditional"
for unrolling. But there's nothing really magical about it here.

End result: wouldn't it have been nice to just do 40-byte chunks, and
make the 64-byte "two overlapping 32-byte chunks" be two of the
40-byte chunks.

Something like the (ENTIRELY UNTESTED!) attached patch?

Again: this is *not* tested. I took a quick look at the generated
assembly, and it looked roughly like what I expected it to look like,
but it may be complete garbage.

I added a couple of "likely()" things just because it made the
generated asm look more natural (ie it followed the order of the
source code there), they are otherwise questionable annotations.

Finally: did I already mention that this is completely untested?

             Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2897 bytes --]

 arch/x86/lib/csum-partial_64.c | 66 ++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 38 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..9f1fa47ccbe0 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -33,6 +33,20 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 	return (__force __wsum)result;
 }
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm volatile("addq %1,%0\n\t"
+		     "adcq %2,%0\n\t"
+		     "adcq %3,%0\n\t"
+		     "adcq %4,%0\n\t"
+		     "adcq %5,%0\n\t"
+		     "adcq $0,%0"
+			:"=r" (sum)
+			:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+			 "m" (m[3]), "m" (m[4]), "0" (sum));
+	return sum;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -64,47 +78,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 	 * has noticeable negative affect on codegen for all other cases with
 	 * minimal performance benefit here.
 	 */
-	if (len == 40) {
-		asm("addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcq 4*8(%[src]),%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [res] "+r"(temp64)
-		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
+	if (likely(len == 40)) {
+		temp64 = update_csum_40b(temp64, buff);
 		return csum_tail(temp64, odd);
 	}
-	if (unlikely(len >= 64)) {
-		/*
-		 * Extra accumulators for better ILP in the loop.
-		 */
-		u64 tmp_accum, tmp_carries;
 
-		asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
-		    "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "1:\n\t"
-		    "addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq 4*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 5*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 6*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 7*8(%[src]),%[tmp_accum]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq $64, %[src]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "jge 1b\n\t"
-		    "addq %[tmp_accum],%[res]\n\t"
-		    "adcq %[tmp_carries],%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [tmp_accum] "=&r"(tmp_accum),
-		      [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
-		      [len] "+r"(len), [src] "+r"(buff)
-		    : "m"(*(const char *)buff));
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"=r" (temp64): "r" (temp64_2));
 	}
 
 	if (len & 32) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:11 ` Linus Torvalds
@ 2023-06-27 20:26   ` Linus Torvalds
  2023-06-27 20:38     ` Borislav Petkov
  2023-06-27 21:44   ` Arjan van de Ven
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 20:26 UTC (permalink / raw)
  To: Borislav Petkov, Noah Goldstein, Dave Hansen; +Cc: x86-ml, lkml

On Tue, 27 Jun 2023 at 13:11, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Finally: did I already mention that this is completely untested?

Oh, this part is buggy:

+               asm("addq %1,%0\n\t"
+                   "adcq $0,%0"
+                   :"=r" (temp64): "r" (temp64_2));

and it needs to show that 'temp64' is an input too.

Dummy me.

The trivial fix is just to make the "=r" be a "+r".

In fact, I should have used "+r" inside update_csum_40b(), but at
least there I did add the proper input constraint, so that one isn't
actively buggy.

And again: I noticed this by looking at the patch one more time. No
actual *testing* has happened. It might still be buggy garbage even
with that "+r". It's just a bit *less* buggy garbage.

I will now go back to my cave and continue pulling stuff, I just had
to do something else for a while. Some people relax with a nice drink
by the pool, I relax by playing around with inline asm.

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:26   ` Linus Torvalds
@ 2023-06-27 20:38     ` Borislav Petkov
  2023-06-27 20:49       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2023-06-27 20:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, Jun 27, 2023 at 01:26:18PM -0700, Linus Torvalds wrote:
> I will now go back to my cave and continue pulling stuff, I just had
> to do something else for a while. Some people relax with a nice drink
> by the pool, I relax by playing around with inline asm.

And there's a third kind who relax by the pool with a nice drink,
*while* playing around with inline asm. ;-P

Btw, I'll send you a new version of this pull request with this patch
dropped to let folks experiment with it more.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:38     ` Borislav Petkov
@ 2023-06-27 20:49       ` Linus Torvalds
  2023-06-27 21:11         ` Borislav Petkov
  2023-06-27 21:44         ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 20:49 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Noah Goldstein, Dave Hansen, x86-ml, lkml

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On Tue, 27 Jun 2023 at 13:38, Borislav Petkov <bp@alien8.de> wrote:
>
> And there's a third kind who relax by the pool with a nice drink,
> *while* playing around with inline asm. ;-P

That explains a lot.

> Btw, I'll send you a new version of this pull request with this patch
> dropped to let folks experiment with it more.

Oh, I already merged it. I don't hate the change, I just looked at it
and went "I would have done that differently" and started playing
around with it.

There's nothing hugely *wrong* with the code I merged, but I do think
that it did too much inside the inline asm (ie looping inside the asm,
but also initializing values that could have - and should have - just
been given as inputs to the asm).

And the whole "why have two different versions for 40-byte and 64-byte
areas, when you _could_ just do it with one 40-byte one that you then
also just unroll".

So I _think_ my version is nicer and shorter - assuming it works and
there are no other bugs than the one I already noticed - but I don't
think it's a huge deal.

Anyway, before I throw my patch away, I'll just post it with the
trivial fixes to use "+r", and with the "volatile" removed (I add
"volatile" to asms by habit, but this one really isn't volatile).

I just checked that both gcc and clang seem to be happy with it, but
that's the only testing this patch has gotten: it compiles for me.

Do with it what you will.

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2869 bytes --]

 arch/x86/lib/csum-partial_64.c | 66 ++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 38 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..61e8c3d97c04 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -33,6 +33,20 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 	return (__force __wsum)result;
 }
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm("addq %1,%0\n\t"
+	     "adcq %2,%0\n\t"
+	     "adcq %3,%0\n\t"
+	     "adcq %4,%0\n\t"
+	     "adcq %5,%0\n\t"
+	     "adcq $0,%0"
+		:"+r" (sum)
+		:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+		 "m" (m[3]), "m" (m[4]));
+	return sum;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -64,47 +78,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 	 * has noticeable negative affect on codegen for all other cases with
 	 * minimal performance benefit here.
 	 */
-	if (len == 40) {
-		asm("addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcq 4*8(%[src]),%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [res] "+r"(temp64)
-		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
+	if (likely(len == 40)) {
+		temp64 = update_csum_40b(temp64, buff);
 		return csum_tail(temp64, odd);
 	}
-	if (unlikely(len >= 64)) {
-		/*
-		 * Extra accumulators for better ILP in the loop.
-		 */
-		u64 tmp_accum, tmp_carries;
 
-		asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
-		    "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "1:\n\t"
-		    "addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq 4*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 5*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 6*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 7*8(%[src]),%[tmp_accum]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq $64, %[src]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "jge 1b\n\t"
-		    "addq %[tmp_accum],%[res]\n\t"
-		    "adcq %[tmp_carries],%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [tmp_accum] "=&r"(tmp_accum),
-		      [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
-		      [len] "+r"(len), [src] "+r"(buff)
-		    : "m"(*(const char *)buff));
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"+r" (temp64): "r" (temp64_2));
 	}
 
 	if (len & 32) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 11:00 [GIT PULL] x86/misc for 6.5 Borislav Petkov
  2023-06-27 20:11 ` Linus Torvalds
@ 2023-06-27 20:52 ` pr-tracker-bot
  1 sibling, 0 replies; 16+ messages in thread
From: pr-tracker-bot @ 2023-06-27 20:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linus Torvalds, x86-ml, lkml

The pull request you sent on Tue, 27 Jun 2023 13:00:38 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_misc_for_v6.5

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4baa098a147d76a9ad1a6867fa14286db52085b6

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:49       ` Linus Torvalds
@ 2023-06-27 21:11         ` Borislav Petkov
  2023-06-27 21:44         ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2023-06-27 21:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, Jun 27, 2023 at 01:49:12PM -0700, Linus Torvalds wrote:
> That explains a lot.

LOL!

That activity has one rule: don't send the code on the same day as the
pool visit. :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:11 ` Linus Torvalds
  2023-06-27 20:26   ` Linus Torvalds
@ 2023-06-27 21:44   ` Arjan van de Ven
  2023-06-27 22:25     ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Arjan van de Ven @ 2023-06-27 21:44 UTC (permalink / raw)
  To: Linus Torvalds, Borislav Petkov, Noah Goldstein, Dave Hansen; +Cc: x86-ml, lkml

On 6/27/2023 1:11 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 04:00, Borislav Petkov <bp@alien8.de> wrote:
>>
>> - Improve csum_partial()'s performance
> 
> Honestly, looking at that patch, my reaction is "why did it get
> unrolled in 64-byte chunks, if 40 bytes is the magic value"?
> 
> Particularly when there is then that "do a carry op each 32 bytes to
> make 32-byte chunks independent and increase ILP". So even the 64-byte
> case isn't *actuall* doing a 64-byte unrolling, it's really doing two
> 32-byte unrollings in parallel.
> 
> So you have three "magic" values, and the only one that really matters
> is likely the 40-byte one.
> 
> Yes, yes, 64 bytes is the usual cacheline size, and is "traditional"
> for unrolling. But there's nothing really magical about it here.
> 
> End result: wouldn't it have been nice to just do 40-byte chunks, and
> make the 64-byte "two overlapping 32-byte chunks" be two of the
> 40-byte chunks.
> 
> Something like the (ENTIRELY UNTESTED!) attached patch?
> 
> Again: this is *not* tested. I took a quick look at the generated
> assembly, and it looked roughly like what I expected it to look like,
> but it may be complete garbage.
> 
> I added a couple of "likely()" things just because it made the
> generated asm look more natural (ie it followed the order of the
> source code there), they are otherwise questionable annotations.
> 
> Finally: did I already mention that this is completely untested?

fwiw long flights and pools have a relation; I made a userspace testbench
for this some time ago: https://github.com/fenrus75/csum_partial
in case one would actually WANT to test ;)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 20:49       ` Linus Torvalds
  2023-06-27 21:11         ` Borislav Petkov
@ 2023-06-27 21:44         ` Linus Torvalds
  2023-06-27 21:58           ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 21:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Noah Goldstein, Dave Hansen, x86-ml, lkml

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

On Tue, 27 Jun 2023 at 13:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, before I throw my patch away, I'll just post it with the
> trivial fixes to use "+r", and with the "volatile" removed (I add
> "volatile" to asms by habit, but this one really isn't volatile).

Oh, never mind. I was about to throw it away, and then I realized that
the code *after* the loop relied on the range having been reduced down
to below 64 bytes, and checked for 32/16/8/4 byte ranges.

And my change to make it loop over 80 bytes had made that no longer be true.

But now I'm committed, and decided to fix that too, and just
re-organize the code to get all the cases right.

And now I'm going to actually boot-test the end result too.  Because
life is too short to spend all my time _just_ with merging.

               Linus

[-- Attachment #2: 0001-Silly-csum-improvement.-Maybe.patch --]
[-- Type: text/x-patch, Size: 3423 bytes --]

From 7ac1b9952815d04d1906cc6dd01bcf742d6a6893 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 27 Jun 2023 13:55:32 -0700
Subject: [PATCH] Silly csum improvement. Maybe.

---
 arch/x86/lib/csum-partial_64.c | 83 ++++++++++++++++------------------
 1 file changed, 38 insertions(+), 45 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..d96e1da6604a 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -33,6 +33,20 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 	return (__force __wsum)result;
 }
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm("addq %1,%0\n\t"
+	     "adcq %2,%0\n\t"
+	     "adcq %3,%0\n\t"
+	     "adcq %4,%0\n\t"
+	     "adcq %5,%0\n\t"
+	     "adcq $0,%0"
+		:"+r" (sum)
+		:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+		 "m" (m[3]), "m" (m[4]));
+	return sum;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -59,52 +73,31 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		buff++;
 	}
 
-	/*
-	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
-	 * has noticeable negative affect on codegen for all other cases with
-	 * minimal performance benefit here.
-	 */
-	if (len == 40) {
-		asm("addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcq 4*8(%[src]),%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [res] "+r"(temp64)
-		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_tail(temp64, odd);
-	}
-	if (unlikely(len >= 64)) {
-		/*
-		 * Extra accumulators for better ILP in the loop.
-		 */
-		u64 tmp_accum, tmp_carries;
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
 
-		asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
-		    "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "1:\n\t"
-		    "addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq 4*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 5*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 6*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 7*8(%[src]),%[tmp_accum]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq $64, %[src]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "jge 1b\n\t"
-		    "addq %[tmp_accum],%[res]\n\t"
-		    "adcq %[tmp_carries],%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [tmp_accum] "=&r"(tmp_accum),
-		      [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
-		      [len] "+r"(len), [src] "+r"(buff)
-		    : "m"(*(const char *)buff));
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"+r" (temp64): "r" (temp64_2));
+	}
+
+	/*
+	 * len == 40 is the hot case due to IPv6 headers, so return
+	 * early for that exact case without checking the tail bytes.
+	 */
+	if (len >= 40) {
+		temp64 = update_csum_40b(temp64, buff);
+		len -= 40;
+		if (!len)
+			return csum_tail(temp64, odd);
+		buff += 40;
 	}
 
 	if (len & 32) {
-- 
2.41.0.203.ga4f2cd32bb.dirty


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 21:44         ` Linus Torvalds
@ 2023-06-27 21:58           ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 21:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, 27 Jun 2023 at 14:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But now I'm committed, and decided to fix that too, and just
> re-organize the code to get all the cases right.
>
> And now I'm going to actually boot-test the end result too.  Because
> life is too short to spend all my time _just_ with merging.

Well, it boots. And I clearly have networking. But who knows how much
that is actually using the csum_partial() function? Not me. I'm just
along for the ride.

Anyway, that last version handles the 40-byte special case
differently, in that it might have done some arbitrary number of
80-byte chunks first. But it shouldn't really make a difference - it
does check for >= 80- bytes first, but we're talking two extra
instructions.

And that way the end case is always less than 64 bytes, and so the
tests for 32/16/8 work fine.

And now it's committed to my test tree, so I'm not throwing it away,
but I also won't be working on it any more. If somebody wants to time
it using Arjan's little thing, more power to them.

               Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 21:44   ` Arjan van de Ven
@ 2023-06-27 22:25     ` Linus Torvalds
  2023-06-27 22:43       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 22:25 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Borislav Petkov, Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, 27 Jun 2023 at 14:44, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> fwiw long flights and pools have a relation; I made a userspace testbench
> for this some time ago: https://github.com/fenrus75/csum_partial
> in case one would actually WANT to test ;)

Hmm.

I don't know what the rules are - and some of the functions you test
seem actively buggy (ie not handling alignment right etc).

But on my machine I get:

  02:   8.6 /  10.4 cycles (e29e455e) Upcoming linux kernel version
  04:   8.6 /  10.4 cycles (e29e455e) Specialized to size 40
  06:   7.7 /   9.5 cycles (e29e455e) New version
  22:   8.7 /   9.6 cycles (e29e455e) Odd-alignment handling removed
 ...

which would seem to mean that my code ("New version") is doing well.

It does do worse on the "odd alignment" case:

  03: 15.5 /  17.8 cycles (00006580) Upcoming linux kernel version
  05: 15.5 /  17.8 cycles (00006580) Specialized to size 40
  07: 16.6 /  19.5 cycles (0000bc29) New version
  23:   8.8 /   8.6 cycles (1de29e47) Odd-alignment handling removed
 ...

I just hacked the code into the benchmark without looking too closely
at what is going on, so no guarantees.

         Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 22:25     ` Linus Torvalds
@ 2023-06-27 22:43       ` Linus Torvalds
  2023-06-27 22:50         ` Arjan van de Ven
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 22:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Borislav Petkov, Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, 27 Jun 2023 at 15:25, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I don't know what the rules are - and some of the functions you test
> seem actively buggy (ie not handling alignment right etc).

Oh. And you *only* test the 40-byte case. That seems a bit bogus.

If I change the packet size from 40 to 1500, I get

  02: 185.1 / 186.4 cycles (8b414316) Upcoming linux kernel version
  04: 184.9 / 186.5 cycles (8b414316) Specialized to size 40
  06: 107.3 / 117.2 cycles (8b414316) New version
  22: 185.6 / 186.5 cycles (8b414316) Odd-alignment handling removed

which seems unexpectedly bad for the other versions.

But those other functions have that 64-byte unrolling, rather than the
"two 40-byte loops", so maybe it is real, and my version is actually
just that good.

Or maybe it's a sign that my version is some seriously buggy crap, and
it just looks good on the benchmark because it does the wrong thing.

Whatever. Back to the merge window again.

              Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 22:43       ` Linus Torvalds
@ 2023-06-27 22:50         ` Arjan van de Ven
  2023-06-27 23:02           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Arjan van de Ven @ 2023-06-27 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Borislav Petkov, Noah Goldstein, Dave Hansen, x86-ml, lkml

On 6/27/2023 3:43 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 15:25, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I don't know what the rules are - and some of the functions you test
>> seem actively buggy (ie not handling alignment right etc).
> 
> Oh. And you *only* test the 40-byte case. That seems a bit bogus.
> 
> If I change the packet size from 40 to 1500, I get
> 
>    02: 185.1 / 186.4 cycles (8b414316) Upcoming linux kernel version
>    04: 184.9 / 186.5 cycles (8b414316) Specialized to size 40
>    06: 107.3 / 117.2 cycles (8b414316) New version
>    22: 185.6 / 186.5 cycles (8b414316) Odd-alignment handling removed
> 
> which seems unexpectedly bad for the other versions.
> 

I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
so can run OOO any time) .. can really have a performance change like this


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 22:50         ` Arjan van de Ven
@ 2023-06-27 23:02           ` Linus Torvalds
  2023-06-27 23:05             ` Arjan van de Ven
  2023-06-29  5:29             ` Herbert Xu
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2023-06-27 23:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Borislav Petkov, Noah Goldstein, Dave Hansen, x86-ml, lkml

On Tue, 27 Jun 2023 at 15:51, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
> so can run OOO any time) .. can really have a performance change like this

How much do people care?

One of the advantages of just having that single "update_csum_40b()"
function is that it's trivial to then manually unroll.

With a 4-way unrolling, I get

  02: 184.0 / 184.5 cycles (8b414316) Upcoming linux kernel version
  04: 184.0 / 184.2 cycles (8b414316) Specialized to size 40
  06: 89.4 / 102.5 cycles (512daed6) New version
  22: 184.6 / 184.4 cycles (8b414316) Odd-alignment handling removed

but doesn't most network hardware do the csum on its own anyway? How
critical is csum_partial(), really?

(The above is obviously your test thing modified for 1500 byte
packets, still. With 40-byte packets, the 4-way unrolling obvious
doesn't help, although it doesn't noticeably hurt either - it's just
one more compare and branch)

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 23:02           ` Linus Torvalds
@ 2023-06-27 23:05             ` Arjan van de Ven
  2023-06-29  5:29             ` Herbert Xu
  1 sibling, 0 replies; 16+ messages in thread
From: Arjan van de Ven @ 2023-06-27 23:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Borislav Petkov, Noah Goldstein, Dave Hansen, x86-ml, lkml

On 6/27/2023 4:02 PM, Linus Torvalds wrote:
> On Tue, 27 Jun 2023 at 15:51, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>> I'm not surprised though; running 2 parallel streams (where one stream has a fixed zero as input,
>> so can run OOO any time) .. can really have a performance change like this
> 
> How much do people care?
> 
> One of the advantages of just having that single "update_csum_40b()"
> function is that it's trivial to then manually unroll.
> 
> With a 4-way unrolling, I get
> 
>    02: 184.0 / 184.5 cycles (8b414316) Upcoming linux kernel version
>    04: 184.0 / 184.2 cycles (8b414316) Specialized to size 40
>    06: 89.4 / 102.5 cycles (512daed6) New version
>    22: 184.6 / 184.4 cycles (8b414316) Odd-alignment handling removed
> 
> but doesn't most network hardware do the csum on its own anyway? How
> critical is csum_partial(), really?

the hardware does most cases..
in
https://lore.kernel.org/netdev/20211111181025.2139131-1-eric.dumazet@gmail.com/
Eric kind of implies it's for IPv6 headers in practice



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] x86/misc for 6.5
  2023-06-27 23:02           ` Linus Torvalds
  2023-06-27 23:05             ` Arjan van de Ven
@ 2023-06-29  5:29             ` Herbert Xu
  1 sibling, 0 replies; 16+ messages in thread
From: Herbert Xu @ 2023-06-29  5:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: arjan, bp, goldstein.w.n, dave.hansen, x86, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> but doesn't most network hardware do the csum on its own anyway? How
> critical is csum_partial(), really?

That hardware will checksum the bulk of the packet, but we still
need to checksum the header bits (e.g., 40 bytes or less) when we
add and remove headers in software.

The one exception is Realtek drivers which seems to come with
checksum offload disabled by default.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-06-29  8:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-27 11:00 [GIT PULL] x86/misc for 6.5 Borislav Petkov
2023-06-27 20:11 ` Linus Torvalds
2023-06-27 20:26   ` Linus Torvalds
2023-06-27 20:38     ` Borislav Petkov
2023-06-27 20:49       ` Linus Torvalds
2023-06-27 21:11         ` Borislav Petkov
2023-06-27 21:44         ` Linus Torvalds
2023-06-27 21:58           ` Linus Torvalds
2023-06-27 21:44   ` Arjan van de Ven
2023-06-27 22:25     ` Linus Torvalds
2023-06-27 22:43       ` Linus Torvalds
2023-06-27 22:50         ` Arjan van de Ven
2023-06-27 23:02           ` Linus Torvalds
2023-06-27 23:05             ` Arjan van de Ven
2023-06-29  5:29             ` Herbert Xu
2023-06-27 20:52 ` pr-tracker-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.