* [PATCH ] x86/lib: Optimise copy loop for long buffers in csum-partial_64.c
@ 2022-01-06 16:19 David Laight
0 siblings, 0 replies; only message in thread
From: David Laight @ 2022-01-06 16:19 UTC (permalink / raw)
To: 'Eric Dumazet', 'Peter Zijlstra'
Cc: 'tglx@linutronix.de', 'mingo@redhat.com',
'Borislav Petkov', 'dave.hansen@linux.intel.com',
'X86 ML', 'hpa@zytor.com',
'alexanderduyck@fb.com', 'open list',
'netdev', 'Noah Goldstein'
gcc converts the loop into one that only increments the pointer
but makes a mess of calculating the limit and gcc 9.1+ completely
refuses to use the final value of 'buff' from the last iteration.
Explicitly code a pointer comparison and don't bother changing len.
Signed-off-by: David Laight <david.laight@aculab.com>
---
The asm("" : "+r" (buff)); forces gcc to use the loop-updated
value of 'buff' and removes at least 6 instructions.
The gcc folk really ought to look at why gcc 9.1 onwards is so
much worse that gcc 8.
See https://godbolt.org/z/T39PcnvfE
arch/x86/lib/csum-partial_64.c | 33 ++++++++++++++++++---------------
1 file changed, 18 insertions(+), 15 deletions(-)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index edd3e579c2a7..342de5f24fcb 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -27,21 +27,24 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
u64 temp64 = (__force u64)sum;
unsigned result;
- while (unlikely(len >= 64)) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
- "adcq 5*8(%[src]),%[res]\n\t"
- "adcq 6*8(%[src]),%[res]\n\t"
- "adcq 7*8(%[src]),%[res]\n\t"
- "adcq $0,%[res]"
- : [res] "+r" (temp64)
- : [src] "r" (buff)
- : "memory");
- buff += 64;
- len -= 64;
+ if (unlikely(len >= 64)) {
+ const void *lim = buff + (len & ~63u);
+ do {
+ asm("addq 0*8(%[src]),%[res]\n\t"
+ "adcq 1*8(%[src]),%[res]\n\t"
+ "adcq 2*8(%[src]),%[res]\n\t"
+ "adcq 3*8(%[src]),%[res]\n\t"
+ "adcq 4*8(%[src]),%[res]\n\t"
+ "adcq 5*8(%[src]),%[res]\n\t"
+ "adcq 6*8(%[src]),%[res]\n\t"
+ "adcq 7*8(%[src]),%[res]\n\t"
+ "adcq $0,%[res]"
+ : [res] "+r" (temp64)
+ : [src] "r" (buff)
+ : "memory");
+ asm("" : "+r" (buff));
+ buff += 64;
+ } while (buff < lim);
}
if (len & 32) {
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] only message in thread
only message in thread, other threads:[~2022-01-06 16:20 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06 16:19 [PATCH ] x86/lib: Optimise copy loop for long buffers in csum-partial_64.c David Laight
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.