[PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
@ 2021-12-13 18:00 David Laight
  2021-12-13 18:40 ` Alexander Duyck
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: David Laight @ 2021-12-13 18:00 UTC (permalink / raw)
  To: 'Noah Goldstein', 'Eric Dumazet'
  Cc: 'tglx@linutronix.de', 'mingo@redhat.com',
	'Borislav Petkov', 'dave.hansen@linux.intel.com',
	'X86 ML', 'hpa@zytor.com',
	'peterz@infradead.org', 'alexanderduyck@fb.com',
	'open list', 'netdev'


Add in the trailing bytes first so that there is no need to worry
about the sum exceeding 64 bits.

Signed-off-by: David Laight <david.laight@aculab.com>
---

This ought to be faster - because of all the removed 'adc $0'.
Guessing how fast x86 code will run is hard!
There are other ways of handing buffers that are shorter than 8 bytes,
but I'd rather hope they don't happen in any hot paths.

Note - I've not even compile tested it.
(But have tested an equivalent change before.)

 arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index abf819dd8525..fbcc073fc2b5 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 	u64 temp64 = (__force u64)sum;
 	unsigned result;
 
+	if (len & 7) {
+		if (unlikely(len < 8)) {
+			/* Avoid falling off the start of the buffer */
+			if (len & 4) {
+				temp64 += *(u32 *)buff;
+				buff += 4;
+			}
+			if (len & 2) {
+				temp64 += *(u16 *)buff;
+				buff += 2;
+			}
+			if (len & 1)
+				temp64 += *(u8 *)buff;
+			goto reduce_to32;
+		}
+		temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
+	}
+
 	while (unlikely(len >= 64)) {
 		asm("addq 0*8(%[src]),%[res]\n\t"
 		    "adcq 1*8(%[src]),%[res]\n\t"
@@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 			: "memory");
 		buff += 8;
 	}
-	if (len & 7) {
-#ifdef CONFIG_DCACHE_WORD_ACCESS
-		unsigned int shift = (8 - (len & 7)) * 8;
-		unsigned long trail;
-
-		trail = (load_unaligned_zeropad(buff) << shift) >> shift;
 
-		asm("addq %[trail],%[res]\n\t"
-		    "adcq $0,%[res]"
-			: [res] "+r" (temp64)
-			: [trail] "r" (trail));
-#else
-		if (len & 4) {
-			asm("addq %[val],%[res]\n\t"
-			    "adcq $0,%[res]"
-				: [res] "+r" (temp64)
-				: [val] "r" ((u64)*(u32 *)buff)
-				: "memory");
-			buff += 4;
-		}
-		if (len & 2) {
-			asm("addq %[val],%[res]\n\t"
-			    "adcq $0,%[res]"
-				: [res] "+r" (temp64)
-				: [val] "r" ((u64)*(u16 *)buff)
-				: "memory");
-			buff += 2;
-		}
-		if (len & 1) {
-			asm("addq %[val],%[res]\n\t"
-			    "adcq $0,%[res]"
-				: [res] "+r" (temp64)
-				: [val] "r" ((u64)*(u8 *)buff)
-				: "memory");
-		}
-#endif
-	}
+reduce_to32:
 	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
 	return (__force __wsum)result;
 }
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
  2021-12-13 18:00 [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes David Laight
@ 2021-12-13 18:40 ` Alexander Duyck
  2021-12-13 22:52   ` David Laight
  2021-12-13 18:45 ` Eric Dumazet
  2021-12-14 12:36 ` David Laight
  2 siblings, 1 reply; 6+ messages in thread
From: Alexander Duyck @ 2021-12-13 18:40 UTC (permalink / raw)
  To: David Laight, 'Noah Goldstein', 'Eric Dumazet'
  Cc: 'tglx@linutronix.de', 'mingo@redhat.com',
	'Borislav Petkov', 'dave.hansen@linux.intel.com',
	'X86 ML', 'hpa@zytor.com',
	'peterz@infradead.org', 'open list',
	'netdev'

> -----Original Message-----
> From: David Laight <David.Laight@ACULAB.COM>
> Sent: Monday, December 13, 2021 10:01 AM
> To: 'Noah Goldstein' <goldstein.w.n@gmail.com>; 'Eric Dumazet'
> <edumazet@google.com>
> Cc: 'tglx@linutronix.de' <tglx@linutronix.de>; 'mingo@redhat.com'
> <mingo@redhat.com>; 'Borislav Petkov' <bp@alien8.de>;
> 'dave.hansen@linux.intel.com' <dave.hansen@linux.intel.com>; 'X86 ML'
> <x86@kernel.org>; 'hpa@zytor.com' <hpa@zytor.com>;
> 'peterz@infradead.org' <peterz@infradead.org>; Alexander Duyck
> <alexanderduyck@fb.com>; 'open list' <linux-kernel@vger.kernel.org>;
> 'netdev' <netdev@vger.kernel.org>
> Subject: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
> multiples of 8 bytes.
> 
> 
> Add in the trailing bytes first so that there is no need to worry about the sum
> exceeding 64 bits.
> 
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> 
> This ought to be faster - because of all the removed 'adc $0'.
> Guessing how fast x86 code will run is hard!
> There are other ways of handing buffers that are shorter than 8 bytes, but I'd
> rather hope they don't happen in any hot paths.
> 
> Note - I've not even compile tested it.
> (But have tested an equivalent change before.)
> 
>  arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
>  1 file changed, 19 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index abf819dd8525..fbcc073fc2b5 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
>  	u64 temp64 = (__force u64)sum;
>  	unsigned result;
> 
> +	if (len & 7) {
> +		if (unlikely(len < 8)) {
> +			/* Avoid falling off the start of the buffer */
> +			if (len & 4) {
> +				temp64 += *(u32 *)buff;
> +				buff += 4;
> +			}
> +			if (len & 2) {
> +				temp64 += *(u16 *)buff;
> +				buff += 2;
> +			}
> +			if (len & 1)
> +				temp64 += *(u8 *)buff;
> +			goto reduce_to32;
> +		}
> +		temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
> +	}
> +

I don't think your shift is headed in the right direction. If your starting offset is "buff + len - 8" then your remaining bits should be in the upper bytes of the qword, not the lower bytes shouldn't they? So I would think it should be ">>" not "<<".

>  	while (unlikely(len >= 64)) {
>  		asm("addq 0*8(%[src]),%[res]\n\t"
>  		    "adcq 1*8(%[src]),%[res]\n\t"
> @@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
>  			: "memory");
>  		buff += 8;
>  	}
> -	if (len & 7) {
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> -		unsigned int shift = (8 - (len & 7)) * 8;
> -		unsigned long trail;
> -
> -		trail = (load_unaligned_zeropad(buff) << shift) >> shift;

Your code above should be equivalent to the load_unaligned_zeropad() << shift, so the shift you are performing above is equivalent to the later one.

> 
> -		asm("addq %[trail],%[res]\n\t"
> -		    "adcq $0,%[res]"
> -			: [res] "+r" (temp64)
> -			: [trail] "r" (trail));
> -#else
> -		if (len & 4) {
> -			asm("addq %[val],%[res]\n\t"
> -			    "adcq $0,%[res]"
> -				: [res] "+r" (temp64)
> -				: [val] "r" ((u64)*(u32 *)buff)
> -				: "memory");
> -			buff += 4;
> -		}
> -		if (len & 2) {
> -			asm("addq %[val],%[res]\n\t"
> -			    "adcq $0,%[res]"
> -				: [res] "+r" (temp64)
> -				: [val] "r" ((u64)*(u16 *)buff)
> -				: "memory");
> -			buff += 2;
> -		}
> -		if (len & 1) {
> -			asm("addq %[val],%[res]\n\t"
> -			    "adcq $0,%[res]"
> -				: [res] "+r" (temp64)
> -				: [val] "r" ((u64)*(u8 *)buff)
> -				: "memory");
> -		}
> -#endif
> -	}
> +reduce_to32:
>  	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>  	return (__force __wsum)result;
>  }
> --
> 2.17.1
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
  2021-12-13 18:00 [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes David Laight
  2021-12-13 18:40 ` Alexander Duyck
@ 2021-12-13 18:45 ` Eric Dumazet
  2021-12-13 19:23   ` Alexander Duyck
  2021-12-14 12:36 ` David Laight
  2 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2021-12-13 18:45 UTC (permalink / raw)
  To: David Laight
  Cc: Noah Goldstein, tglx, mingo, Borislav Petkov, dave.hansen,
	X86 ML, hpa, peterz, alexanderduyck, open list, netdev

On Mon, Dec 13, 2021 at 10:00 AM David Laight <David.Laight@aculab.com> wrote:
>
>
> Add in the trailing bytes first so that there is no need to worry
> about the sum exceeding 64 bits.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>
> This ought to be faster - because of all the removed 'adc $0'.
> Guessing how fast x86 code will run is hard!
> There are other ways of handing buffers that are shorter than 8 bytes,
> but I'd rather hope they don't happen in any hot paths.
>
> Note - I've not even compile tested it.
> (But have tested an equivalent change before.)
>
>  arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
>  1 file changed, 19 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index abf819dd8525..fbcc073fc2b5 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>         u64 temp64 = (__force u64)sum;
>         unsigned result;
>
> +       if (len & 7) {
> +               if (unlikely(len < 8)) {
> +                       /* Avoid falling off the start of the buffer */
> +                       if (len & 4) {
> +                               temp64 += *(u32 *)buff;
> +                               buff += 4;
> +                       }
> +                       if (len & 2) {
> +                               temp64 += *(u16 *)buff;
> +                               buff += 2;
> +                       }
> +                       if (len & 1)
> +                               temp64 += *(u8 *)buff;
> +                       goto reduce_to32;
> +               }
> +               temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;

This is reading far away (end of buffer).

Maybe instead read the first bytes and adjust @buff, to allow for
better hardware prefetching ?



> +       }
> +
>         while (unlikely(len >= 64)) {
>                 asm("addq 0*8(%[src]),%[res]\n\t"
>                     "adcq 1*8(%[src]),%[res]\n\t"
> @@ -82,43 +100,8 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>                         : "memory");
>                 buff += 8;
>         }
> -       if (len & 7) {
> -#ifdef CONFIG_DCACHE_WORD_ACCESS
> -               unsigned int shift = (8 - (len & 7)) * 8;
> -               unsigned long trail;
> -
> -               trail = (load_unaligned_zeropad(buff) << shift) >> shift;
>
> -               asm("addq %[trail],%[res]\n\t"
> -                   "adcq $0,%[res]"
> -                       : [res] "+r" (temp64)
> -                       : [trail] "r" (trail));
> -#else
> -               if (len & 4) {
> -                       asm("addq %[val],%[res]\n\t"
> -                           "adcq $0,%[res]"
> -                               : [res] "+r" (temp64)
> -                               : [val] "r" ((u64)*(u32 *)buff)
> -                               : "memory");
> -                       buff += 4;
> -               }
> -               if (len & 2) {
> -                       asm("addq %[val],%[res]\n\t"
> -                           "adcq $0,%[res]"
> -                               : [res] "+r" (temp64)
> -                               : [val] "r" ((u64)*(u16 *)buff)
> -                               : "memory");
> -                       buff += 2;
> -               }
> -               if (len & 1) {
> -                       asm("addq %[val],%[res]\n\t"
> -                           "adcq $0,%[res]"
> -                               : [res] "+r" (temp64)
> -                               : [val] "r" ((u64)*(u8 *)buff)
> -                               : "memory");
> -               }
> -#endif
> -       }
> +reduce_to32:
>         result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>         return (__force __wsum)result;
>  }
> --
> 2.17.1
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
  2021-12-13 18:45 ` Eric Dumazet
@ 2021-12-13 19:23   ` Alexander Duyck
  0 siblings, 0 replies; 6+ messages in thread
From: Alexander Duyck @ 2021-12-13 19:23 UTC (permalink / raw)
  To: Eric Dumazet, David Laight
  Cc: Noah Goldstein, tglx, mingo, Borislav Petkov, dave.hansen,
	X86 ML, hpa, peterz, open list, netdev



> -----Original Message-----
> From: Eric Dumazet <edumazet@google.com>
> Sent: Monday, December 13, 2021 10:45 AM
> To: David Laight <David.Laight@aculab.com>
> Cc: Noah Goldstein <goldstein.w.n@gmail.com>; tglx@linutronix.de;
> mingo@redhat.com; Borislav Petkov <bp@alien8.de>;
> dave.hansen@linux.intel.com; X86 ML <x86@kernel.org>; hpa@zytor.com;
> peterz@infradead.org; Alexander Duyck <alexanderduyck@fb.com>; open
> list <linux-kernel@vger.kernel.org>; netdev <netdev@vger.kernel.org>
> Subject: Re: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
> multiples of 8 bytes.
> 
> On Mon, Dec 13, 2021 at 10:00 AM David Laight <David.Laight@aculab.com>
> wrote:
> >
> >
> > Add in the trailing bytes first so that there is no need to worry
> > about the sum exceeding 64 bits.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes,
> > but I'd rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> >  arch/x86/lib/csum-partial_64.c | 55
> > ++++++++++++----------------------
> >  1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c
> > b/arch/x86/lib/csum-partial_64.c index abf819dd8525..fbcc073fc2b5
> > 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
> >         u64 temp64 = (__force u64)sum;
> >         unsigned result;
> >
> > +       if (len & 7) {
> > +               if (unlikely(len < 8)) {
> > +                       /* Avoid falling off the start of the buffer */
> > +                       if (len & 4) {
> > +                               temp64 += *(u32 *)buff;
> > +                               buff += 4;
> > +                       }
> > +                       if (len & 2) {
> > +                               temp64 += *(u16 *)buff;
> > +                               buff += 2;
> > +                       }
> > +                       if (len & 1)
> > +                               temp64 += *(u8 *)buff;
> > +                       goto reduce_to32;
> > +               }
> > +               temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7))
> > + * 8;
> 
> This is reading far away (end of buffer).
> 
> Maybe instead read the first bytes and adjust @buff, to allow for better
> hardware prefetching ?

That will cause the wsum to be aligned to the length instead of the buff wouldn't it? So we would need an extra rotation at the end to realign odd length sections wouldn't we?

Since our only concern here would be large buffers would it maybe make sense to just run the loop in the call in reverse if we make this change? That way we would still be accessing the same cache line once we start the loop. For smaller buffers I would imagine the overhead should be minimal since we likely would have the end of the buffer still in cache since it would be something like 40B over anyway.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
  2021-12-13 18:40 ` Alexander Duyck
@ 2021-12-13 22:52   ` David Laight
  0 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2021-12-13 22:52 UTC (permalink / raw)
  To: 'Alexander Duyck', 'Noah Goldstein',
	'Eric Dumazet'
  Cc: 'tglx@linutronix.de', 'mingo@redhat.com',
	'Borislav Petkov', 'dave.hansen@linux.intel.com',
	'X86 ML', 'hpa@zytor.com',
	'peterz@infradead.org', 'open list',
	'netdev'

From: Alexander Duyck <alexanderduyck@fb.com>
> Sent: 13 December 2021 18:40
...
> > Add in the trailing bytes first so that there is no need to worry about the sum
> > exceeding 64 bits.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes, but I'd
> > rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> >  arch/x86/lib/csum-partial_64.c | 55 ++++++++++++----------------------
> >  1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index abf819dd8525..fbcc073fc2b5 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> > __wsum sum)
> >  	u64 temp64 = (__force u64)sum;
> >  	unsigned result;
> >
> > +	if (len & 7) {
> > +		if (unlikely(len < 8)) {
> > +			/* Avoid falling off the start of the buffer */
> > +			if (len & 4) {
> > +				temp64 += *(u32 *)buff;
> > +				buff += 4;
> > +			}
> > +			if (len & 2) {
> > +				temp64 += *(u16 *)buff;
> > +				buff += 2;
> > +			}
> > +			if (len & 1)
> > +				temp64 += *(u8 *)buff;
> > +			goto reduce_to32;
> > +		}
> > +		temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7)) * 8;
> > +	}
> > +
> 
> I don't think your shift is headed in the right direction. If your starting offset is "buff + len - 8"
> then your remaining bits should be in the upper bytes of the qword, not the lower bytes shouldn't
> they? So I would think it should be ">>" not "<<".

Brain-fart :-)
It needs to discard the low bytes - so >> is indeed right.
I did say I hadn't tested it.

Cache line wise I'm not sure whether it matters.
If the data is in the cache it doesn't matter.
If the data isn't in the cache then the only real problem is if the
line gets evicted - only likely for 4k-ish+ buffers.
I'd guess the largest checksum is under 1500 bytes - hardware doing
TSO will be doing hardware checksums. So evections are unlikely.

Plausibly the *(buf + len - 8) read could be done after the while() loop.
That would need an adc and a saved copy of the length (or a read that would trap)
but would only be loading the 'next' cache line.

So you'd end up with something like:
		while (len >= 64) {
			...
		}
		if (len & 7)
			trail = *(u64 *)(buff + len - 8) >> (8 - (len & 7)) * 8;
		if (len & 32)
			...
		if (len & 16)
			...
		if (len & 8)
			...
		temp64 += trail
		adc $0, temp64

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.
  2021-12-13 18:00 [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes David Laight
  2021-12-13 18:40 ` Alexander Duyck
  2021-12-13 18:45 ` Eric Dumazet
@ 2021-12-14 12:36 ` David Laight
  2 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2021-12-14 12:36 UTC (permalink / raw)
  To: David Laight, 'Noah Goldstein', 'Eric Dumazet'
  Cc: 'tglx@linutronix.de', 'mingo@redhat.com',
	'Borislav Petkov', 'dave.hansen@linux.intel.com',
	'X86 ML', 'hpa@zytor.com',
	'peterz@infradead.org', 'alexanderduyck@fb.com',
	'open list', 'netdev'

From: David Laight <David.Laight@ACULAB.COM>
> Sent: 13 December 2021 18:01
> 
> Add in the trailing bytes first so that there is no need to worry
> about the sum exceeding 64 bits.

This is an alternate version that (mostly) compiles to reasonable code.
I've also booted a kernel with it - networking still works!

https://godbolt.org/z/K6vY31Gqs

I changed the while (len >= 64) loop into an
if (len >= 64) do (...) while(len >= 64) one.
But gcc makes a pigs breakfast of compiling it - it optimises
it so that it is while (ptr < lim) but adds a lot of code.
So I've done that by hand.
Then it still makes a meal of it because it refuses to take
'buff' from the final loop iteration.
An assignment to the limit helps.

Then there is the calculation of (8 - (len & 7)) * 8.
gcc prior to 9.2 just negate (len & 7) then use leal 56(,%rs1,8),%rcx.
But later ones and fail to notice.
Even given (64 + 8 * -(len & 7)) clang fails to use leal.

I'm not even sure the code clang generates is right:
(%rsi is (len & 7))
        movq    -8(%rsi,%rax), %rdx
        leal    (,%rsi,8), %ecx
        andb    $56, %cl
        negb    %cl
        shrq    %cl, %rdx

The 'negb' is the wrong size of the 'andb'.
It might be ok if it is assuming the cpu ignores the high 2 bits of %cl.
But that is a horrid assumption to be making.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-12-14 12:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13 18:00 [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes David Laight
2021-12-13 18:40 ` Alexander Duyck
2021-12-13 22:52   ` David Laight
2021-12-13 18:45 ` Eric Dumazet
2021-12-13 19:23   ` Alexander Duyck
2021-12-14 12:36 ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).