[PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c
@ 2021-12-13 14:43 David Laight
  2021-12-13 15:02 ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2021-12-13 14:43 UTC (permalink / raw)
  To: David Laight, 'Noah Goldstein', Eric Dumazet
  Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz,
	alexanderduyck, open list, netdev

There is no need to special case the very unusual odd-aligned buffers.
They are no worse than 4n+2 aligned buffers.

Signed-off-by: David Laight <david.laight@aculab.com>
---

On an i7-7700 misaligned buffers add 2 or 3 clocks (in 115) to a 512 byte
  checksum.
That is just measuring the main loop with an lfence prior to rdpmc to
read PERF_COUNT_HW_CPU_CYCLES.

 arch/x86/lib/csum-partial_64.c | 16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 40b527ba1da1..abf819dd8525 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -35,17 +35,7 @@ static inline unsigned short from32to16(unsigned a)
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	u64 temp64 = (__force u64)sum;
-	unsigned odd, result;
-
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		if (unlikely(len == 0))
-			return sum;
-		temp64 = ror32((__force u32)sum, 8);
-		temp64 += (*(unsigned char *)buff << 8);
-		len--;
-		buff++;
-	}
+	unsigned result;
 
 	while (unlikely(len >= 64)) {
 		asm("addq 0*8(%[src]),%[res]\n\t"
@@ -130,10 +120,6 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 #endif
 	}
 	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
-	if (unlikely(odd)) { 
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
 	return (__force __wsum)result;
 }
 EXPORT_SYMBOL(csum_partial);
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c
  2021-12-13 14:43 [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c David Laight
@ 2021-12-13 15:02 ` Dave Hansen
  2021-12-13 15:37   ` David Laight
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2021-12-13 15:02 UTC (permalink / raw)
  To: David Laight, 'Noah Goldstein', Eric Dumazet
  Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz,
	alexanderduyck, open list, netdev

On 12/13/21 6:43 AM, David Laight wrote:
> There is no need to special case the very unusual odd-aligned buffers.
> They are no worse than 4n+2 aligned buffers.
> 
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> 
> On an i7-7700 misaligned buffers add 2 or 3 clocks (in 115) to a 512 byte
>   checksum.
> That is just measuring the main loop with an lfence prior to rdpmc to
> read PERF_COUNT_HW_CPU_CYCLES.

I'm a bit confused by this changelog.

Are you saying that the patch causes a (small) performance regression?

Are you also saying that the optimization here is not worth it because
it saves 15 lines of code?  Or that the misalignment checks themselves
add 2 or 3 cycles, and this is an *optimization*?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c
  2021-12-13 15:02 ` Dave Hansen
@ 2021-12-13 15:37   ` David Laight
  2021-12-13 15:56     ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2021-12-13 15:37 UTC (permalink / raw)
  To: 'Dave Hansen', 'Noah Goldstein', Eric Dumazet
  Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz,
	alexanderduyck, open list, netdev

From: Dave Hansen
> Sent: 13 December 2021 15:02
.c
> 
> On 12/13/21 6:43 AM, David Laight wrote:
> > There is no need to special case the very unusual odd-aligned buffers.
> > They are no worse than 4n+2 aligned buffers.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > On an i7-7700 misaligned buffers add 2 or 3 clocks (in 115) to a 512 byte
> >   checksum.
> > That is just measuring the main loop with an lfence prior to rdpmc to
> > read PERF_COUNT_HW_CPU_CYCLES.
> 
> I'm a bit confused by this changelog.
> 
> Are you saying that the patch causes a (small) performance regression?
> 
> Are you also saying that the optimization here is not worth it because
> it saves 15 lines of code?  Or that the misalignment checks themselves
> add 2 or 3 cycles, and this is an *optimization*?

I'm saying that it can't be worth optimising for a misaligned
buffer because the cost of the buffer being misaligned is so small.
So the test for a misaligned buffer are going to cost more than
and plausible gain.

Not only that the buffer will never be odd aligned at all.

The code is left in from a previous version that did do aligned
word reads - so had to do extra for odd alignment.

Note that code is doing misaligned reads for the more likely 4n+2
aligned ethernet receive buffers.
I doubt that even a test for that would be worthwhile even if you
were checksumming full sized ethernet packets.

So the change is deleting code that is never actually executed
from the hot path.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c
  2021-12-13 15:37   ` David Laight
@ 2021-12-13 15:56     ` Eric Dumazet
  2021-12-13 16:16       ` David Laight
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2021-12-13 15:56 UTC (permalink / raw)
  To: David Laight
  Cc: Dave Hansen, Noah Goldstein, tglx, mingo, Borislav Petkov,
	dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list,
	netdev

On Mon, Dec 13, 2021 at 7:37 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Dave Hansen
> > Sent: 13 December 2021 15:02
> .c
> >
> > On 12/13/21 6:43 AM, David Laight wrote:
> > > There is no need to special case the very unusual odd-aligned buffers.
> > > They are no worse than 4n+2 aligned buffers.
> > >
> > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > ---
> > >
> > > On an i7-7700 misaligned buffers add 2 or 3 clocks (in 115) to a 512 byte
> > >   checksum.
> > > That is just measuring the main loop with an lfence prior to rdpmc to
> > > read PERF_COUNT_HW_CPU_CYCLES.
> >
> > I'm a bit confused by this changelog.
> >
> > Are you saying that the patch causes a (small) performance regression?
> >
> > Are you also saying that the optimization here is not worth it because
> > it saves 15 lines of code?  Or that the misalignment checks themselves
> > add 2 or 3 cycles, and this is an *optimization*?
>
> I'm saying that it can't be worth optimising for a misaligned
> buffer because the cost of the buffer being misaligned is so small.
> So the test for a misaligned buffer are going to cost more than
> and plausible gain.
>
> Not only that the buffer will never be odd aligned at all.
>
> The code is left in from a previous version that did do aligned
> word reads - so had to do extra for odd alignment.
>
> Note that code is doing misaligned reads for the more likely 4n+2
> aligned ethernet receive buffers.
> I doubt that even a test for that would be worthwhile even if you
> were checksumming full sized ethernet packets.
>
> So the change is deleting code that is never actually executed
> from the hot path.
>

I think I left this code because I got confused with odd/even case,
but this is handled by upper functions like csum_block_add()

What matters is not if the start of a frag is odd/even, but what
offset it is in the overall ' frame', if a frame is split into multiple
areas (scatter/gather)

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c
  2021-12-13 15:56     ` Eric Dumazet
@ 2021-12-13 16:16       ` David Laight
  0 siblings, 0 replies; 5+ messages in thread
From: David Laight @ 2021-12-13 16:16 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: Dave Hansen, Noah Goldstein, tglx, mingo, Borislav Petkov,
	dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list,
	netdev

From: Eric Dumazet
> Sent: 13 December 2021 15:56
> 
> On Mon, Dec 13, 2021 at 7:37 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Dave Hansen
> > > Sent: 13 December 2021 15:02
> > .c
> > >
> > > On 12/13/21 6:43 AM, David Laight wrote:
> > > > There is no need to special case the very unusual odd-aligned buffers.
> > > > They are no worse than 4n+2 aligned buffers.
> > > >
> > > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > > ---
> > > >
> > > > On an i7-7700 misaligned buffers add 2 or 3 clocks (in 115) to a 512 byte
> > > >   checksum.
> > > > That is just measuring the main loop with an lfence prior to rdpmc to
> > > > read PERF_COUNT_HW_CPU_CYCLES.
> > >
> > > I'm a bit confused by this changelog.
> > >
> > > Are you saying that the patch causes a (small) performance regression?
> > >
> > > Are you also saying that the optimization here is not worth it because
> > > it saves 15 lines of code?  Or that the misalignment checks themselves
> > > add 2 or 3 cycles, and this is an *optimization*?
> >
> > I'm saying that it can't be worth optimising for a misaligned
> > buffer because the cost of the buffer being misaligned is so small.
> > So the test for a misaligned buffer are going to cost more than
> > and plausible gain.
> >
> > Not only that the buffer will never be odd aligned at all.
> >
> > The code is left in from a previous version that did do aligned
> > word reads - so had to do extra for odd alignment.
> >
> > Note that code is doing misaligned reads for the more likely 4n+2
> > aligned ethernet receive buffers.
> > I doubt that even a test for that would be worthwhile even if you
> > were checksumming full sized ethernet packets.
> >
> > So the change is deleting code that is never actually executed
> > from the hot path.
> >
> 
> I think I left this code because I got confused with odd/even case,
> but this is handled by upper functions like csum_block_add()
> 
> What matters is not if the start of a frag is odd/even, but what
> offset it is in the overall ' frame', if a frame is split into multiple
> areas (scatter/gather)

Yes odd length fragments are a different problem.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-13 16:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13 14:43 [PATCH] x86/lib: Remove the special case for odd-aligned buffers in csum_partial.c David Laight
2021-12-13 15:02 ` Dave Hansen
2021-12-13 15:37   ` David Laight
2021-12-13 15:56     ` Eric Dumazet
2021-12-13 16:16       ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.