* [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c @ 2021-11-25 19:38 Noah Goldstein 2021-11-26 1:50 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-25 19:38 UTC (permalink / raw) Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, goldstein.w.n, edumazet, linux-kernel Modify the 8x loop to that it uses two independent accumulators. Despite adding more instructions the latency and throughput of the loop is improved because the `adc` chains can now take advantage of multiple execution units. Make the memory clobbers more precise. 'buff' is read only and we know the exact usage range. There is no reason to write-clobber all memory. Relative performance changes on Tigerlake: Time Unit: Ref Cycles Size Unit: Bytes size, lat old, lat new, tput old, tput new 0, 4.972, 5.054, 4.864, 4.870 100, 14.218, 12.476, 9.429, 9.441 200, 22.115, 16.937, 13.088, 12.852 300, 31.826, 24.640, 19.383, 18.230 400, 39.016, 28.133, 23.223, 21.304 500, 48.815, 36.186, 30.331, 27.104 600, 56.732, 40.120, 35.899, 30.363 700, 66.623, 48.178, 43.044, 36.400 800, 73.259, 51.171, 48.564, 39.173 900, 82.821, 56.635, 58.592, 45.162 1000, 90.780, 63.703, 65.658, 48.718 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> --- arch/x86/lib/csum-partial_64.c | 36 +++++++++++++++++----------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index ded842cd1020..76e2f540587e 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) } while (unlikely(len >= 64)) { - asm("addq 0*8(%[src]),%[res]\n\t" - "adcq 1*8(%[src]),%[res]\n\t" - "adcq 2*8(%[src]),%[res]\n\t" - "adcq 3*8(%[src]),%[res]\n\t" - "adcq 4*8(%[src]),%[res]\n\t" + u64 temp_accum; + + asm("movq 0*8(%[src]),%[res_tmp]\n\t" + "addq 1*8(%[src]),%[res_tmp]\n\t" + "adcq 2*8(%[src]),%[res_tmp]\n\t" + "adcq 3*8(%[src]),%[res_tmp]\n\t" + "adcq $0,%[res_tmp]\n\t" + "addq 4*8(%[src]),%[res]\n\t" "adcq 5*8(%[src]),%[res]\n\t" "adcq 6*8(%[src]),%[res]\n\t" "adcq 7*8(%[src]),%[res]\n\t" - "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + "adcq %[res_tmp], %[res]\n\t" + "adcq $0,%[res]\n\t" + : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum) + : [src] "r"(buff), "m"(*(const char(*)[64])buff)); buff += 64; len -= 64; } @@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) "adcq 2*8(%[src]),%[res]\n\t" "adcq 3*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[32])buff)); buff += 32; } if (len & 16) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq 1*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[16])buff)); buff += 16; } if (len & 8) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[8])buff)); buff += 8; } if (len & 7) { -- 2.25.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein @ 2021-11-26 1:50 ` Eric Dumazet 2021-11-26 2:15 ` Noah Goldstein 2021-11-27 4:25 ` [PATCH v2] " Noah Goldstein 2021-11-27 6:39 ` [PATCH v3] " Noah Goldstein 2 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 1:50 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, linux-kernel On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Modify the 8x loop to that it uses two independent > accumulators. Despite adding more instructions the latency and > throughput of the loop is improved because the `adc` chains can now > take advantage of multiple execution units. Nice ! Note that I get better results if I do a different split, because the second chain gets shorter. First chain adds 5*8 bytes from the buffer, but first bytes are a mere load, so that is really 4+1 additions. Second chain adds 3*8 bytes from the buffer, plus the result coming from the first chain, also 4+1 additions. asm("movq 0*8(%[src]),%[res_tmp]\n\t" "addq 1*8(%[src]),%[res_tmp]\n\t" "adcq 2*8(%[src]),%[res_tmp]\n\t" "adcq 3*8(%[src]),%[res_tmp]\n\t" "adcq 4*8(%[src]),%[res_tmp]\n\t" "adcq $0,%[res_tmp]\n\t" "addq 5*8(%[src]),%[res]\n\t" "adcq 6*8(%[src]),%[res]\n\t" "adcq 7*8(%[src]),%[res]\n\t" "adcq %[res_tmp],%[res]\n\t" "adcq $0,%[res]" : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) : [src] "r" (buff) : "memory"); > > Make the memory clobbers more precise. 'buff' is read only and we know > the exact usage range. There is no reason to write-clobber all memory. Not sure if that matters in this function ? Or do we expect it being inlined ? Personally, I find the "memory" constraint to be more readable than these casts "m"(*(const char(*)[64])buff)); > > Relative performance changes on Tigerlake: > > Time Unit: Ref Cycles > Size Unit: Bytes > > size, lat old, lat new, tput old, tput new > 0, 4.972, 5.054, 4.864, 4.870 Really what matters in modern networking is the case for 40 bytes, and eventually 8 bytes. Can you add these two cases in this nice table ? We hardly have to checksum anything with NIC that are not decades old. Apparently making the 64byte loop slightly longer incentives gcc to move it away (our intent with the unlikely() hint). Anyway I am thinking of providing a specialized inline version for IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the time), so we will likely not use csum_partial() anymore. Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 1:50 ` Eric Dumazet @ 2021-11-26 2:15 ` Noah Goldstein 2021-11-26 2:18 ` Noah Goldstein 2021-11-26 16:08 ` Eric Dumazet 0 siblings, 2 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 2:15 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Modify the 8x loop to that it uses two independent > > accumulators. Despite adding more instructions the latency and > > throughput of the loop is improved because the `adc` chains can now > > take advantage of multiple execution units. > > Nice ! > > Note that I get better results if I do a different split, because the > second chain gets shorter. > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere > load, so that is really 4+1 additions. > > Second chain adds 3*8 bytes from the buffer, plus the result coming > from the first chain, also 4+1 additions. Good call. With that approach I also see an improvement for the 32 byte case (which is on the hot path) so this change might actually matter :) > > asm("movq 0*8(%[src]),%[res_tmp]\n\t" > "addq 1*8(%[src]),%[res_tmp]\n\t" > "adcq 2*8(%[src]),%[res_tmp]\n\t" > "adcq 3*8(%[src]),%[res_tmp]\n\t" > "adcq 4*8(%[src]),%[res_tmp]\n\t" > "adcq $0,%[res_tmp]\n\t" > "addq 5*8(%[src]),%[res]\n\t" > "adcq 6*8(%[src]),%[res]\n\t" > "adcq 7*8(%[src]),%[res]\n\t" > "adcq %[res_tmp],%[res]\n\t" > "adcq $0,%[res]" > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) > : [src] "r" (buff) > : "memory"); > > > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > the exact usage range. There is no reason to write-clobber all memory. > > Not sure if that matters in this function ? Or do we expect it being inlined ? It may matter for LTO build. I also think it can matter for the loop case. I didn't see any difference when playing around with the function in userland with: ``` gcc -O3 -march=native -mtune=native checksum.c -o checksum ``` but IIRC if the clobber is loops with inline assembly payloads can be de-optimized if GCC can't prove the iterations don't affect each other. > > Personally, I find the "memory" constraint to be more readable than these casts > "m"(*(const char(*)[64])buff)); > Hmm, I personally find it more readable if I can tell what memory transforms happen just from reading the clobbers, but you're the maintainer. Do you want it changed in V2? > > > > Relative performance changes on Tigerlake: > > > > Time Unit: Ref Cycles > > Size Unit: Bytes > > > > size, lat old, lat new, tput old, tput new > > 0, 4.972, 5.054, 4.864, 4.870 > > Really what matters in modern networking is the case for 40 bytes, and > eventually 8 bytes. > > Can you add these two cases in this nice table ? > Sure, with your suggestion in the 32 byte cases there is an improvement there too. > We hardly have to checksum anything with NIC that are not decades old. > > Apparently making the 64byte loop slightly longer incentives gcc to > move it away (our intent with the unlikely() hint). > > Anyway I am thinking of providing a specialized inline version for > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the > time), > so we will likely not use csum_partial() anymore. I see. For now is it worth adding a case for 40 in this implementation? if(likely(len == 40)) { // optimized 40 + buff aligned case } else { // existing code } > > Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 2:15 ` Noah Goldstein @ 2021-11-26 2:18 ` Noah Goldstein 2021-11-26 2:38 ` Noah Goldstein 2021-11-26 16:08 ` Eric Dumazet 1 sibling, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 2:18 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Thu, Nov 25, 2021 at 8:15 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > Modify the 8x loop to that it uses two independent > > > accumulators. Despite adding more instructions the latency and > > > throughput of the loop is improved because the `adc` chains can now > > > take advantage of multiple execution units. > > > > Nice ! > > > > Note that I get better results if I do a different split, because the > > second chain gets shorter. > > > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere > > load, so that is really 4+1 additions. > > > > Second chain adds 3*8 bytes from the buffer, plus the result coming > > from the first chain, also 4+1 additions. > > Good call. With that approach I also see an improvement for the 32 byte > case (which is on the hot path) so this change might actually matter :) > > > > > asm("movq 0*8(%[src]),%[res_tmp]\n\t" > > "addq 1*8(%[src]),%[res_tmp]\n\t" > > "adcq 2*8(%[src]),%[res_tmp]\n\t" > > "adcq 3*8(%[src]),%[res_tmp]\n\t" > > "adcq 4*8(%[src]),%[res_tmp]\n\t" > > "adcq $0,%[res_tmp]\n\t" > > "addq 5*8(%[src]),%[res]\n\t" > > "adcq 6*8(%[src]),%[res]\n\t" > > "adcq 7*8(%[src]),%[res]\n\t" > > "adcq %[res_tmp],%[res]\n\t" > > "adcq $0,%[res]" > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) > > : [src] "r" (buff) > > : "memory"); > > > > > > > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > > the exact usage range. There is no reason to write-clobber all memory. > > > > Not sure if that matters in this function ? Or do we expect it being inlined ? > > It may matter for LTO build. I also think it can matter for the loop > case. I didn't see > any difference when playing around with the function in userland with: > > ``` > gcc -O3 -march=native -mtune=native checksum.c -o checksum > ``` > > but IIRC if the clobber is loops with inline assembly payloads can be > de-optimized if GCC can't prove the iterations don't affect each other. > > > > > > Personally, I find the "memory" constraint to be more readable than these casts > > "m"(*(const char(*)[64])buff)); > > > > Hmm, I personally find it more readable if I can tell what memory > transforms happen > just from reading the clobbers, but you're the maintainer. > > Do you want it changed in V2? > > > > > > > Relative performance changes on Tigerlake: > > > > > > Time Unit: Ref Cycles > > > Size Unit: Bytes > > > > > > size, lat old, lat new, tput old, tput new > > > 0, 4.972, 5.054, 4.864, 4.870 > > > > Really what matters in modern networking is the case for 40 bytes, and > > eventually 8 bytes. > > > > Can you add these two cases in this nice table ? > > > > Sure, with your suggestion in the 32 byte cases there is an improvement there > too. > > > We hardly have to checksum anything with NIC that are not decades old. > > > > Apparently making the 64byte loop slightly longer incentives gcc to > > move it away (our intent with the unlikely() hint). Do you think the 40/48 byte case might be better of in GAS assembly. It's a bit difficult to get proper control flow optimization with GCC + inline assembly even with likely/unlikely (i.e expanding the 64 byte case moves it off hotpath, cmov + ror instead of fallthrough for hotpath). > > > > Anyway I am thinking of providing a specialized inline version for > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the > > time), > > so we will likely not use csum_partial() anymore. > > I see. For now is it worth adding a case for 40 in this implementation? > > if(likely(len == 40)) { > // optimized 40 + buff aligned case > } > else { > // existing code > } > > > > > > Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 2:18 ` Noah Goldstein @ 2021-11-26 2:38 ` Noah Goldstein 2021-11-28 19:47 ` David Laight 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 2:38 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Thu, Nov 25, 2021 at 8:18 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Thu, Nov 25, 2021 at 8:15 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > Modify the 8x loop to that it uses two independent > > > > accumulators. Despite adding more instructions the latency and > > > > throughput of the loop is improved because the `adc` chains can now > > > > take advantage of multiple execution units. > > > > > > Nice ! > > > > > > Note that I get better results if I do a different split, because the > > > second chain gets shorter. > > > > > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere > > > load, so that is really 4+1 additions. > > > > > > Second chain adds 3*8 bytes from the buffer, plus the result coming > > > from the first chain, also 4+1 additions. > > > > Good call. With that approach I also see an improvement for the 32 byte > > case (which is on the hot path) so this change might actually matter :) > > > > > > > > asm("movq 0*8(%[src]),%[res_tmp]\n\t" > > > "addq 1*8(%[src]),%[res_tmp]\n\t" > > > "adcq 2*8(%[src]),%[res_tmp]\n\t" > > > "adcq 3*8(%[src]),%[res_tmp]\n\t" > > > "adcq 4*8(%[src]),%[res_tmp]\n\t" > > > "adcq $0,%[res_tmp]\n\t" > > > "addq 5*8(%[src]),%[res]\n\t" > > > "adcq 6*8(%[src]),%[res]\n\t" > > > "adcq 7*8(%[src]),%[res]\n\t" > > > "adcq %[res_tmp],%[res]\n\t" > > > "adcq $0,%[res]" > > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) > > > : [src] "r" (buff) > > > : "memory"); > > > > > > > > > > > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > > > the exact usage range. There is no reason to write-clobber all memory. > > > > > > Not sure if that matters in this function ? Or do we expect it being inlined ? > > > > It may matter for LTO build. I also think it can matter for the loop > > case. I didn't see > > any difference when playing around with the function in userland with: > > > > ``` > > gcc -O3 -march=native -mtune=native checksum.c -o checksum > > ``` > > > > but IIRC if the clobber is loops with inline assembly payloads can be > > de-optimized if GCC can't prove the iterations don't affect each other. > > > > > > > > > > Personally, I find the "memory" constraint to be more readable than these casts > > > "m"(*(const char(*)[64])buff)); > > > > > > > Hmm, I personally find it more readable if I can tell what memory > > transforms happen > > just from reading the clobbers, but you're the maintainer. > > > > Do you want it changed in V2? > > > > > > > > > > Relative performance changes on Tigerlake: > > > > > > > > Time Unit: Ref Cycles > > > > Size Unit: Bytes > > > > > > > > size, lat old, lat new, tput old, tput new > > > > 0, 4.972, 5.054, 4.864, 4.870 > > > > > > Really what matters in modern networking is the case for 40 bytes, and > > > eventually 8 bytes. > > > > > > Can you add these two cases in this nice table ? Regarding the 32 byte case, adding two accumulators helps with the latency numbers but causes a regression in throughput for the 40/48 byte cases. Which is the more important metric for the usage of csum_partial()? Here are the numbers for the smaller sizes: size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655 'ver2' uses two accumulators for 32 byte case and has better latency numbers but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the implementation posted which has essentially the same numbers for tput/lat as 'old' for sizes [0, 63]. > > > > > > > Sure, with your suggestion in the 32 byte cases there is an improvement there > > too. > > > > > We hardly have to checksum anything with NIC that are not decades old. > > > > > > Apparently making the 64byte loop slightly longer incentives gcc to > > > move it away (our intent with the unlikely() hint). > > Do you think the 40/48 byte case might be better of in GAS assembly. It's > a bit difficult to get proper control flow optimization with GCC + > inline assembly > even with likely/unlikely (i.e expanding the 64 byte case moves it off hotpath, > cmov + ror instead of fallthrough for hotpath). > > > > > > Anyway I am thinking of providing a specialized inline version for > > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the > > > time), > > > so we will likely not use csum_partial() anymore. > > > > I see. For now is it worth adding a case for 40 in this implementation? > > > > if(likely(len == 40)) { > > // optimized 40 + buff aligned case > > } > > else { > > // existing code > > } > > > > > > > > > > Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 2:38 ` Noah Goldstein @ 2021-11-28 19:47 ` David Laight 2021-11-28 20:59 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: David Laight @ 2021-11-28 19:47 UTC (permalink / raw) To: 'Noah Goldstein', Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list ... > Regarding the 32 byte case, adding two accumulators helps with the latency > numbers but causes a regression in throughput for the 40/48 byte cases. Which > is the more important metric for the usage of csum_partial()? > > Here are the numbers for the smaller sizes: > > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1 > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951 > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252 > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278 > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279 > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537 > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836 > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234 > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447 > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989 > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277 > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548 > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858 > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206 > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502 > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771 > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130 > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655 > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the > implementation > posted which has essentially the same numbers for tput/lat as 'old' > for sizes [0, 63]. Which cpu are you testing on - it will make a big difference ? And what are you measing throughput in? And are you testing aligned or mis-aligned 64bit reads? I think one of the performance counters will give 'cpu clocks'. I did some tests early last year and got 8 bytes/clock on broadwell/haswell with code that 'loop carried' the carry flag (a single adc chain). On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2 for the result, but the carry flag is available earlier. So alternating the target register in the 'adc' chain will give (nearly) 8 bytes/clock - I think I got to 7.5. That can all be done with only 4 reads per interaction. IIRC broadwell/haswell only need 2 reads/iteration. It is actually likely (certainly worth checking) that haswell/broadwell can do two misaligned memory reads every clock. So it may not be worth aligning the reads (which the old code did). In any case aren't tx packets likely to be aligned, and rx ones misaligned to some known 4n+2 boundary? Using adxc/adxo together is a right PITA. I did get (about) 12 bytes/clock fo long buffers while loop carrying both the overflow and carry flags. Also is there a copy of the patched code anywhere? I think I've missed some of the patches and they are difficult to follow. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-28 19:47 ` David Laight @ 2021-11-28 20:59 ` Noah Goldstein 2021-11-28 22:41 ` David Laight 2021-12-02 14:24 ` David Laight 0 siblings, 2 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-28 20:59 UTC (permalink / raw) To: David Laight Cc: Eric Dumazet, tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@aculab.com> wrote: > > ... > > Regarding the 32 byte case, adding two accumulators helps with the latency > > numbers but causes a regression in throughput for the 40/48 byte cases. Which > > is the more important metric for the usage of csum_partial()? > > > > Here are the numbers for the smaller sizes: > > > > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1 > > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951 > > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252 > > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278 > > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279 > > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537 > > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836 > > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234 > > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447 > > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989 > > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277 > > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548 > > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858 > > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206 > > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502 > > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771 > > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130 > > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655 > > > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers > > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the > > implementation > > posted which has essentially the same numbers for tput/lat as 'old' > > for sizes [0, 63]. > > Which cpu are you testing on - it will make a big difference ? Tigerlake, although assuming `adc` as the bottleneck, the results should be largely independent. > And what are you measing throughput in? Running back to back iterations with the same input without any dependency between iterations. The OoO window will include multiple iterations at once. > And are you testing aligned or mis-aligned 64bit reads? Aligned as that is the common case. > > I think one of the performance counters will give 'cpu clocks'. Time is in Ref Cycles using `rdtsc` > > I did some tests early last year and got 8 bytes/clock on broadwell/haswell > with code that 'loop carried' the carry flag (a single adc chain). > On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2 > for the result, but the carry flag is available earlier. > So alternating the target register in the 'adc' chain will give (nearly) > 8 bytes/clock - I think I got to 7.5. > > That can all be done with only 4 reads per interaction. > IIRC broadwell/haswell only need 2 reads/iteration. > > It is actually likely (certainly worth checking) that haswell/broadwell > can do two misaligned memory reads every clock. > So it may not be worth aligning the reads (which the old code did). > In any case aren't tx packets likely to be aligned, and rx ones > misaligned to some known 4n+2 boundary? > > Using adxc/adxo together is a right PITA. I'm a bit hesitant about adxc/adxo because they are extensions so support will need to be tested. > I did get (about) 12 bytes/clock fo long buffers while loop carrying > both the overflow and carry flags. > > Also is there a copy of the patched code anywhere? https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/ > I think I've missed some of the patches and they are difficult to follow. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-28 20:59 ` Noah Goldstein @ 2021-11-28 22:41 ` David Laight 2021-12-02 14:24 ` David Laight 1 sibling, 0 replies; 30+ messages in thread From: David Laight @ 2021-11-28 22:41 UTC (permalink / raw) To: 'Noah Goldstein' Cc: Eric Dumazet, tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list From: Noah Goldstein > Sent: 28 November 2021 21:00 > > On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@aculab.com> wrote: > > > > ... > > > Regarding the 32 byte case, adding two accumulators helps with the latency > > > numbers but causes a regression in throughput for the 40/48 byte cases. Which > > > is the more important metric for the usage of csum_partial()? > > > > > > Here are the numbers for the smaller sizes: > > > > > > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1 > > > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951 > > > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252 > > > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278 > > > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279 > > > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537 > > > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836 > > > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234 > > > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447 > > > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989 > > > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277 > > > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548 > > > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858 > > > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206 > > > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502 > > > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771 > > > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130 > > > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655 > > > > > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers > > > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the > > > implementation > > > posted which has essentially the same numbers for tput/lat as 'old' > > > for sizes [0, 63]. > > > > Which cpu are you testing on - it will make a big difference ? > > Tigerlake, although assuming `adc` as the bottleneck, the results > should be largely independent. The cpu definitely makes a difference. Although the big changes for Intel mainstream cpu was before Ivy/Sandy bridge and to Broadwell/Haswell. They improved the latency for adc from 2 clocks to 1 clock. The later cpu also have extra instruction ports - which can help parallel exceution. That could well make a big difference where you've duplicated the adc chain. Interesting an adc chain can only do one add (8 bytes) per clock. But you should be able to do two 4 byte loads and add to two separate 64bit register every clock. That is also 8 bytes/clock. So a C loop: for (;buf != buf_end; buf += 2) { sum_64a += buf_32[0]; sum_64b += buf_32[1]; } Might be as fast as the asm one! It probably needs unrolling once, and may need 'adjusting' so that the loop test is 'add $d, %reg; jnz 1b' The 'add' and 'jnz' should then get 'fused' into a single u-op. > > And what are you measing throughput in? > > Running back to back iterations with the same input without any > dependency between iterations. The OoO window will include > multiple iterations at once. > > > And are you testing aligned or mis-aligned 64bit reads? > > Aligned as that is the common case. I was thinking that the code to align the buffer start may not be needed - so the test can be removed without affecting performance. Especially if aligned buffers are 'expected' So you just have to support variable lengths, not alignments. > > I think one of the performance counters will give 'cpu clocks'. > > Time is in Ref Cycles using `rdtsc` Hmmm... Unless you manage to lock the cpu frequency governor (or whatever it is called) rdtsc is almost useless for timing instructions. The cpu hardware will change the clock multiplier on you. I've used histogram[delta_cyles() >> n]++ (with bound checking) to get a distribution of the cost of each pass - rather than an average. I last used that testing writev("dev/null", ...). Each run gave a sharp peak - but it differed run to run! Possibly because of the relative alignment of the stacks and buffers. .. > > Using adxc/adxo together is a right PITA. > > I'm a bit hesitant about adxc/adxo because they are extensions so > support will need to be tested. Indeed. > https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/ Ah that is your patch that just changes the asm() block. I was looking for Eric's changes as well. I spent too long trying to optimise this code last year. Mostly because I found a really crap version in one of our applications that is checksumming UDP/IPv4 RTP packets before sending them on a RAW socket. These are about 200 bytes each and we are sending 100s every 20ms. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-28 20:59 ` Noah Goldstein 2021-11-28 22:41 ` David Laight @ 2021-12-02 14:24 ` David Laight 2021-12-02 15:01 ` Eric Dumazet 1 sibling, 1 reply; 30+ messages in thread From: David Laight @ 2021-12-02 14:24 UTC (permalink / raw) To: 'Noah Goldstein' Cc: Eric Dumazet, tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list, netdev I've dug out my test program and measured the performance of various copied of the inner loop - usually 64 bytes/iteration. Code is below. It uses the hardware performance counter to get the number of clocks the inner loop takes. This is reasonable stable once the branch predictor has settled down. So the different in clocks between a 64 byte buffer and a 128 byte buffer is the number of clocks for 64 bytes. (Unlike the TSC the pmc count doesn't depend on the cpu frequency.) What is interesting is that even some of the trivial loops appear to be doing 16 bytes per clock for short buffers - which is impossible. Checksum 1k bytes and you get an entirely different answer. The only loop that really exceeds 8 bytes/clock for long buffers is the adxc/adoc one. What is almost certainly happening is that all the memory reads and the dependant add/adc instructions are all queued up in the 'out of order' execution unit. Since 'rdpmc' isn't a serialising instruction they can still be outstanding when the function returns. Uncomment the 'rdtsc' and you get much slower values for short buffers. When testing the full checksum function the queued up memory reads and adc are probably running in parallel with the logic that is handling lengths that aren't multiples of 64. I also found nothing consistently different for misaligned reads. These were all tested on my i7-7700 cpu. #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <linux/perf_event.h> #include <sys/mman.h> #include <sys/syscall.h> static int init_pmc(void) { static struct perf_event_attr perf_attr = { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .pinned = 1, }; struct perf_event_mmap_page *pc; int perf_fd; perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0); if (perf_fd < 0) { fprintf(stderr, "perf_event_open failed: errno %d\n", errno); exit(1); } pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0); if (pc == MAP_FAILED) { fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno); exit(1); } return pc->index - 1; } static inline unsigned int rdpmc(unsigned int counter) { unsigned int low, high; // asm volatile("rdtsc" : "=a" (low), "=d" (high)); asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter)); // return low bits, counter might to 32 or 40 bits wide. return low; } #ifndef MODE #define MODE 0 #endif __attribute__((noinline)) unsigned long csum64(unsigned long csum, const unsigned char *buff, unsigned long len) { #if MODE == 0 // Simple 'adc' loop, abs max 8 bytes/clock. // Will be very slow on old cpu (Pre Ivy bridge) cpu where 'dec' // has a false dependency against the carry flag. // Nearly as good if doing 32 bytes/iteration. #define DFLT_OFFSET 40 long count = len/32; asm(" clc\n" "1: adcq (%[buff]), %[csum]\n" " adcq 8(%[buff]), %[csum]\n" " adcq 16(%[buff]), %[csum]\n" " adcq 24(%[buff]), %[csum]\n" " lea 32(%[buff]), %[buff]\n" " dec %[count]\n" " jnz 1b\n" " adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); #endif #if MODE == 1 // Alternate 'adc' loop that avoids using the carry flag. #define DFLT_OFFSET 40 buff += len; len = -len; asm(" clc\n" "1: jrcxz 2f\n" " adcq 0(%[buff], %[len]), %[csum]\n" " adcq 8(%[buff], %[len]), %[csum]\n" " adcq 16(%[buff], %[len]), %[csum]\n" " adcq 24(%[buff], %[len]), %[csum]\n" " lea 32(%[len]), %[len]\n" " jmp 1b\n" "2: adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&c" (len) : [buff] "r" (buff) : "memory"); #endif #if MODE == 2 // Existing kernel loop. // The adc chain limits it to about 8 bytes/clock. #define DFLT_OFFSET 40 long count = len/64; asm("1: addq 0(%[buff]), %[csum]\n" " adcq 8(%[buff]), %[csum]\n" " adcq 16(%[buff]), %[csum]\n" " adcq 24(%[buff]), %[csum]\n" " adcq 32(%[buff]), %[csum]\n" " adcq 40(%[buff]), %[csum]\n" " adcq 48(%[buff]), %[csum]\n" " adcq 56(%[buff]), %[csum]\n" " adcq $0, %[csum]\n" " lea 64(%[buff]), %[buff]\n" " dec %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); #endif #if MODE == 3 // Proposed kernel loop. // This splits the adc chain in two. // Faster than (2) for short buffers - not sure why. #define DFLT_OFFSET 38 long count = len/64; unsigned long csum1; asm("1: movq 0(%[buff]), %[csum1]\n" " adcq 8(%[buff]), %[csum1]\n" " adcq 16(%[buff]), %[csum1]\n" " adcq 24(%[buff]), %[csum1]\n" " adcq 32(%[buff]), %[csum1]\n" " adcq $0, %[csum1]\n" " lea 64(%[buff]), %[buff]\n" " addq 40-64(%[buff]), %[csum]\n" " adcq 48-64(%[buff]), %[csum]\n" " adcq 56-64(%[buff]), %[csum]\n" " adcq %[csum1], %[csum]\n" " adcq $0, %[csum]\n" " dec %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); #endif #if MODE == 4 // Less unrolled version of (3). // Maybe just as fast! #define DFLT_OFFSET 43 long count = len; unsigned long csum1; asm("1: movq 0(%[buff]), %[csum1]\n" " adcq 8(%[buff]), %[csum1]\n" " adcq $0, %[csum1]\n" " lea 32(%[buff]), %[buff]\n" " addq 16-32(%[buff]), %[csum]\n" " adcq 24-32(%[buff]), %[csum]\n" " adcq %[csum1], %[csum]\n" " adcq $0, %[csum]\n" " sub $32, %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); #endif #if MODE == 5 // adxc/adxo loop. // This is the only loop that exceeds 8 bytes/clock for 1k buffers. #define DFLT_OFFSET 40 unsigned long csum_odd; buff += len; len = -len; asm( " xor %[csum_odd], %[csum_odd]\n" // Also clears carry and overflow "10: jrcxz 20f\n" " adcx (%[buff], %[len]), %[csum]\n" " adox 8(%[buff], %[len]), %[csum_odd]\n" " adcx 16(%[buff], %[len]), %[csum]\n" " adox 24(%[buff], %[len]), %[csum_odd]\n" " adcx 32(%[buff], %[len]), %[csum]\n" " adox 40(%[buff], %[len]), %[csum_odd]\n" " adcx 48(%[buff], %[len]), %[csum]\n" " adox 56(%[buff], %[len]), %[csum_odd]\n" " lea 64(%[len]), %[len]\n" " jmp 10b\n" "20: adox %[len], %[csum_odd]\n" // [len] is zero " adcx %[csum_odd], %[csum]\n" " adcx %[len], %[csum]" : [csum] "+&r" (csum), [len] "+&c" (len), [csum_odd] "=&r" (csum_odd) : [buff] "r" (buff) : "memory"); #endif return csum; } unsigned char buff[8192] __attribute__((aligned(4096))) = { 0x46,0x56,0x20,0x04,0x10,0x02,0x50,0x07,0x72,0x4d,0xc6,0x3d,0x31,0x85,0x2d,0xbd, 0xe2,0xe0,0x9d,0x3e,0x3b,0x7a,0x70,0x3d,0xd2,0xfb,0x8c,0xbf,0x95,0x10,0xa9,0xbe, 0xeb,0xfd,0x29,0x40,0xd5,0x7a,0x61,0x40,0xde,0xcd,0x14,0xbf,0x81,0x1b,0xf6,0x3f, 0xbc,0xff,0x17,0x3f,0x67,0x1c,0x6e,0xbe,0xf4,0xc2,0x05,0x40,0x0b,0x13,0x78,0x3f, 0xfe,0x47,0xa7,0xbd,0x59,0xc2,0x15,0x3f,0x07,0xd0,0xea,0xbf,0x97,0xf1,0x3c,0x3f, 0xcc,0xfa,0x6b,0x40,0x72,0x6a,0x4f,0xbe,0x0b,0xe3,0x75,0x3e,0x3c,0x9b,0x0e,0xbf, 0xa9,0xeb,0xb7,0x3f,0xeb,0x4a,0xec,0x3e,0x33,0x8c,0x0c,0x3f,0x6a,0xf2,0xf3,0x3e, 0x2b,0x45,0x86,0x3f,0x83,0xce,0x8a,0x3f,0xf6,0x01,0x16,0x40,0x9c,0x17,0x47,0x3e, 0x44,0x83,0x61,0x40,0x74,0xc7,0x5c,0x3f,0xec,0xe7,0x95,0x3f,0xee,0x19,0xb5,0xbf, 0xb5,0xf0,0x03,0xbf,0xd1,0x02,0x1c,0x3e,0xa3,0x55,0x90,0xbe,0x1e,0x0b,0xa1,0xbf, 0xa4,0xa8,0xb4,0x3f,0xc6,0x68,0x91,0x3f,0xd1,0xc5,0xab,0x3f,0xb9,0x14,0x62,0x3f, 0x7c,0xe0,0xb9,0xbf,0xc0,0xa4,0xb5,0x3d,0x6f,0xd9,0xa7,0x3f,0x8f,0xc4,0xb0,0x3d, 0x48,0x2c,0x7a,0x3e,0x83,0xb2,0x3c,0x40,0x36,0xd3,0x18,0x40,0xb7,0xa9,0x57,0x40, 0xda,0xd3,0x95,0x3f,0x74,0x95,0xc0,0xbe,0xbb,0xce,0x71,0x3e,0x95,0xec,0x18,0xbf, 0x94,0x17,0xdd,0x3f,0x98,0xa5,0x02,0x3f,0xbb,0xfb,0xbb,0x3e,0xd0,0x5a,0x9c,0x3f, 0xd4,0x70,0x9b,0xbf,0x3b,0x9f,0x20,0xc0,0x84,0x5b,0x0f,0x40,0x5e,0x48,0x2c,0xbf, }; #ifndef PASSES #define PASSES 10 #endif #ifndef OFFSET #ifdef DFLT_OFFSET #define OFFSET DFLT_OFFSET #else #define OFFSET 0 #endif #endif int main(int argc, char **argv) { unsigned long csum; unsigned int tick; unsigned int ticks[PASSES]; unsigned int len, off; unsigned int i; unsigned int id = init_pmc(); len = sizeof buff; off = 0; if (argv[1]) { len = atoi(argv[1]); len = (len + 63) & ~63; if (argv[2]) off = atoi(argv[2]) & 7; } for (i = 0; i < PASSES; i++) { tick = rdpmc(id); csum = csum64(0, buff + off, len); ticks[i] = rdpmc(id) - tick; } printf(" ticks rate %lx\n", csum); for (i = 0; i < PASSES; i++) printf(" %7u %7u\n", ticks[i], 100 * len / (ticks[i] - OFFSET)); return 0; } - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-12-02 14:24 ` David Laight @ 2021-12-02 15:01 ` Eric Dumazet 2021-12-02 20:19 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-12-02 15:01 UTC (permalink / raw) To: David Laight Cc: Noah Goldstein, tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list, netdev On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@aculab.com> wrote: > > I've dug out my test program and measured the performance of > various copied of the inner loop - usually 64 bytes/iteration. > Code is below. > > It uses the hardware performance counter to get the number of > clocks the inner loop takes. > This is reasonable stable once the branch predictor has settled down. > So the different in clocks between a 64 byte buffer and a 128 byte > buffer is the number of clocks for 64 bytes. > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.) > > What is interesting is that even some of the trivial loops appear > to be doing 16 bytes per clock for short buffers - which is impossible. > Checksum 1k bytes and you get an entirely different answer. > The only loop that really exceeds 8 bytes/clock for long buffers > is the adxc/adoc one. > > What is almost certainly happening is that all the memory reads and > the dependant add/adc instructions are all queued up in the 'out of > order' execution unit. > Since 'rdpmc' isn't a serialising instruction they can still be > outstanding when the function returns. > Uncomment the 'rdtsc' and you get much slower values for short buffers. > > When testing the full checksum function the queued up memory > reads and adc are probably running in parallel with the logic > that is handling lengths that aren't multiples of 64. > > I also found nothing consistently different for misaligned reads. > > These were all tested on my i7-7700 cpu. > I usually do not bother timing each call. I instead time a loop of 1,000,000,000 calls. Yes, this includes loop cost, but this is the same cost for all variants. for (i = 0; i < 100*1000*1000; i++) { res += csum_partial((void *)frame + 14 + 64*0, 40, 0); res += csum_partial((void *)frame + 14 + 64*1, 40, 0); res += csum_partial((void *)frame + 14 + 64*2, 40, 0); res += csum_partial((void *)frame + 14 + 64*3, 40, 0); res += csum_partial((void *)frame + 14 + 64*4, 40, 0); res += csum_partial((void *)frame + 14 + 64*5, 40, 0); res += csum_partial((void *)frame + 14 + 64*6, 40, 0); res += csum_partial((void *)frame + 14 + 64*7, 40, 0); res += csum_partial((void *)frame + 14 + 64*8, 40, 0); res += csum_partial((void *)frame + 14 + 64*9, 40, 0); } Then use " perf stat ./bench" or similar. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-12-02 15:01 ` Eric Dumazet @ 2021-12-02 20:19 ` Noah Goldstein 2021-12-02 21:11 ` David Laight 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-12-02 20:19 UTC (permalink / raw) To: Eric Dumazet Cc: David Laight, tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list, netdev On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@aculab.com> wrote: > > > > I've dug out my test program and measured the performance of > > various copied of the inner loop - usually 64 bytes/iteration. > > Code is below. > > > > It uses the hardware performance counter to get the number of > > clocks the inner loop takes. > > This is reasonable stable once the branch predictor has settled down. > > So the different in clocks between a 64 byte buffer and a 128 byte > > buffer is the number of clocks for 64 bytes. Intuitively 10 passes is a bit low. Also you might consider aligning the `csum64` function and possibly the loops. There a reason you put ` jrcxz` at the beginning of the loops instead of the end? > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.) > > > > What is interesting is that even some of the trivial loops appear > > to be doing 16 bytes per clock for short buffers - which is impossible. > > Checksum 1k bytes and you get an entirely different answer. > > The only loop that really exceeds 8 bytes/clock for long buffers > > is the adxc/adoc one. > > > > What is almost certainly happening is that all the memory reads and > > the dependant add/adc instructions are all queued up in the 'out of > > order' execution unit. > > Since 'rdpmc' isn't a serialising instruction they can still be > > outstanding when the function returns. > > Uncomment the 'rdtsc' and you get much slower values for short buffers. Maybe add an `lfence` before / after `csum64` > > > > When testing the full checksum function the queued up memory > > reads and adc are probably running in parallel with the logic > > that is handling lengths that aren't multiples of 64. > > > > I also found nothing consistently different for misaligned reads. > > > > These were all tested on my i7-7700 cpu. > > > > I usually do not bother timing each call. > I instead time a loop of 1,000,000,000 calls. > Yes, this includes loop cost, but this is the same cost for all variants. > for (i = 0; i < 100*1000*1000; i++) { > res += csum_partial((void *)frame + 14 + 64*0, 40, 0); > res += csum_partial((void *)frame + 14 + 64*1, 40, 0); > res += csum_partial((void *)frame + 14 + 64*2, 40, 0); > res += csum_partial((void *)frame + 14 + 64*3, 40, 0); > res += csum_partial((void *)frame + 14 + 64*4, 40, 0); > res += csum_partial((void *)frame + 14 + 64*5, 40, 0); > res += csum_partial((void *)frame + 14 + 64*6, 40, 0); > res += csum_partial((void *)frame + 14 + 64*7, 40, 0); > res += csum_partial((void *)frame + 14 + 64*8, 40, 0); > res += csum_partial((void *)frame + 14 + 64*9, 40, 0); > } + 1. You can also feed `res` from previous iteration to the next iteration to measure latency cheaply if that is better predictor of performance. > > Then use " perf stat ./bench" or similar. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-12-02 20:19 ` Noah Goldstein @ 2021-12-02 21:11 ` David Laight 0 siblings, 0 replies; 30+ messages in thread From: David Laight @ 2021-12-02 21:11 UTC (permalink / raw) To: 'Noah Goldstein', Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list, netdev From: Noah Goldstein > Sent: 02 December 2021 20:19 > > On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@aculab.com> wrote: > > > > > > I've dug out my test program and measured the performance of > > > various copied of the inner loop - usually 64 bytes/iteration. > > > Code is below. > > > > > > It uses the hardware performance counter to get the number of > > > clocks the inner loop takes. > > > This is reasonable stable once the branch predictor has settled down. > > > So the different in clocks between a 64 byte buffer and a 128 byte > > > buffer is the number of clocks for 64 bytes. > > Intuitively 10 passes is a bit low. I'm doing 10 separate measurements. The first one is much slower because the cache is cold. All the ones after (typically) number 5 or 6 tend to give the same answer. 10 is plenty to give you that 'warm fuzzy feeling' that you've got a consistent answer. Run the program 5 or 6 times with the same parameters and you sometimes get a different stable value - probably something to do with stack and data physical pages. Was more obvious when I was timing a system call. > Also you might consider aligning > the `csum64` function and possibly the loops. Won't matter here, instruction decode isn't the problem. Also the uops all come out of the loop uop cache. > There a reason you put ` jrcxz` at the beginning of the loops instead > of the end? jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop! The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero. But that is 7+ cycles on current Intel cpu (ok on amd ones). I can get a two clock loop with jrcxz and jmp - as in the examples. But it is more stable taken out to 4 clocks. You can't do a one clock loop :-( > > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.) > > > > > > What is interesting is that even some of the trivial loops appear > > > to be doing 16 bytes per clock for short buffers - which is impossible. > > > Checksum 1k bytes and you get an entirely different answer. > > > The only loop that really exceeds 8 bytes/clock for long buffers > > > is the adxc/adoc one. > > > > > > What is almost certainly happening is that all the memory reads and > > > the dependant add/adc instructions are all queued up in the 'out of > > > order' execution unit. > > > Since 'rdpmc' isn't a serialising instruction they can still be > > > outstanding when the function returns. > > > Uncomment the 'rdtsc' and you get much slower values for short buffers. > > Maybe add an `lfence` before / after `csum64` That's probably less strong than rdtsc, I might try it. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 2:15 ` Noah Goldstein 2021-11-26 2:18 ` Noah Goldstein @ 2021-11-26 16:08 ` Eric Dumazet 2021-11-26 18:17 ` Noah Goldstein 2021-11-26 18:17 ` Eric Dumazet 1 sibling, 2 replies; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 16:08 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > Modify the 8x loop to that it uses two independent > > > accumulators. Despite adding more instructions the latency and > > > throughput of the loop is improved because the `adc` chains can now > > > take advantage of multiple execution units. > > > > Nice ! > > > > Note that I get better results if I do a different split, because the > > second chain gets shorter. > > > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere > > load, so that is really 4+1 additions. > > > > Second chain adds 3*8 bytes from the buffer, plus the result coming > > from the first chain, also 4+1 additions. > > Good call. With that approach I also see an improvement for the 32 byte > case (which is on the hot path) so this change might actually matter :) > > > > > asm("movq 0*8(%[src]),%[res_tmp]\n\t" > > "addq 1*8(%[src]),%[res_tmp]\n\t" > > "adcq 2*8(%[src]),%[res_tmp]\n\t" > > "adcq 3*8(%[src]),%[res_tmp]\n\t" > > "adcq 4*8(%[src]),%[res_tmp]\n\t" > > "adcq $0,%[res_tmp]\n\t" > > "addq 5*8(%[src]),%[res]\n\t" > > "adcq 6*8(%[src]),%[res]\n\t" > > "adcq 7*8(%[src]),%[res]\n\t" > > "adcq %[res_tmp],%[res]\n\t" > > "adcq $0,%[res]" > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) > > : [src] "r" (buff) > > : "memory"); > > > > > > > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > > the exact usage range. There is no reason to write-clobber all memory. > > > > Not sure if that matters in this function ? Or do we expect it being inlined ? > > It may matter for LTO build. I also think it can matter for the loop > case. I didn't see > any difference when playing around with the function in userland with: > > ``` > gcc -O3 -march=native -mtune=native checksum.c -o checksum > ``` > > but IIRC if the clobber is loops with inline assembly payloads can be > de-optimized if GCC can't prove the iterations don't affect each other. > > > > > > Personally, I find the "memory" constraint to be more readable than these casts > > "m"(*(const char(*)[64])buff)); > > > > Hmm, I personally find it more readable if I can tell what memory > transforms happen > just from reading the clobbers, but you're the maintainer. I am not x86 maintainer at all, I was only giving an opinion. > > Do you want it changed in V2? Let's hear from x86 maintainers :) > > > > > > > Relative performance changes on Tigerlake: > > > > > > Time Unit: Ref Cycles > > > Size Unit: Bytes > > > > > > size, lat old, lat new, tput old, tput new > > > 0, 4.972, 5.054, 4.864, 4.870 > > > > Really what matters in modern networking is the case for 40 bytes, and > > eventually 8 bytes. > > > > Can you add these two cases in this nice table ? > > > > Sure, with your suggestion in the 32 byte cases there is an improvement there > too. > > > We hardly have to checksum anything with NIC that are not decades old. > > > > Apparently making the 64byte loop slightly longer incentives gcc to > > move it away (our intent with the unlikely() hint). > > > > Anyway I am thinking of providing a specialized inline version for > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the > > time), > > so we will likely not use csum_partial() anymore. > > I see. For now is it worth adding a case for 40 in this implementation? > > if(likely(len == 40)) { > // optimized 40 + buff aligned case > } > else { > // existing code > } This is what I had first in my tree to demonstrate what gains could be achieved. But decided to keep csum_partial() as a generic function, yet optimize it a bit. My plan is to have a new inline helper, that could be a mere csum_partial() on arches that do not have an optimized version. Preliminary patch to show the idea (this only compiles on x86, we have to add the generic definition) First call point is in GRO stack. Note some of us in the US are not working until next Monday. diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend) (__force unsigned)addend); } +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum) +{ + u64 temp64; + + if (unlikely(len == 40)) + return csum_partial(buff, len, sum); + + temp64 = (__force u64)sum; + asm("addq 0*8(%[src]),%[res]\n\t" + "adcq 1*8(%[src]),%[res]\n\t" + "adcq 2*8(%[src]),%[res]\n\t" + "adcq 3*8(%[src]),%[res]\n\t" + "adcq 4*8(%[src]),%[res]\n\t" + "adcq 5*8(%[src]),%[res]\n\t" + "adcq $0,%[res]" + : [res] "+r" (temp64) + : [src] "r" (buff) + : "memory"); + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64); +} +#define ipv6_csum_partial ipv6_csum_partial + #endif /* _ASM_X86_CHECKSUM_64_H */ diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, NAPI_GRO_CB(skb)->is_atomic = true; NAPI_GRO_CB(skb)->flush |= flush; - skb_gro_postpull_rcsum(skb, iph, nlen); + if (NAPI_GRO_CB(skb)->csum_valid) + NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen, + ~NAPI_GRO_CB(skb)->csum); pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive, ops->callbacks.gro_receive, head, skb); ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 16:08 ` Eric Dumazet @ 2021-11-26 18:17 ` Noah Goldstein 2021-11-26 18:27 ` Eric Dumazet 2021-11-26 18:17 ` Eric Dumazet 1 sibling, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 18:17 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 10:09 AM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Nov 25, 2021 at 6:15 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Thu, Nov 25, 2021 at 7:50 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > Modify the 8x loop to that it uses two independent > > > > accumulators. Despite adding more instructions the latency and > > > > throughput of the loop is improved because the `adc` chains can now > > > > take advantage of multiple execution units. > > > > > > Nice ! > > > > > > Note that I get better results if I do a different split, because the > > > second chain gets shorter. > > > > > > First chain adds 5*8 bytes from the buffer, but first bytes are a mere > > > load, so that is really 4+1 additions. > > > > > > Second chain adds 3*8 bytes from the buffer, plus the result coming > > > from the first chain, also 4+1 additions. > > > > Good call. With that approach I also see an improvement for the 32 byte > > case (which is on the hot path) so this change might actually matter :) > > > > > > > > asm("movq 0*8(%[src]),%[res_tmp]\n\t" > > > "addq 1*8(%[src]),%[res_tmp]\n\t" > > > "adcq 2*8(%[src]),%[res_tmp]\n\t" > > > "adcq 3*8(%[src]),%[res_tmp]\n\t" > > > "adcq 4*8(%[src]),%[res_tmp]\n\t" > > > "adcq $0,%[res_tmp]\n\t" > > > "addq 5*8(%[src]),%[res]\n\t" > > > "adcq 6*8(%[src]),%[res]\n\t" > > > "adcq 7*8(%[src]),%[res]\n\t" > > > "adcq %[res_tmp],%[res]\n\t" > > > "adcq $0,%[res]" > > > : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum) > > > : [src] "r" (buff) > > > : "memory"); > > > > > > > > > > > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > > > the exact usage range. There is no reason to write-clobber all memory. > > > > > > Not sure if that matters in this function ? Or do we expect it being inlined ? > > > > It may matter for LTO build. I also think it can matter for the loop > > case. I didn't see > > any difference when playing around with the function in userland with: > > > > ``` > > gcc -O3 -march=native -mtune=native checksum.c -o checksum > > ``` > > > > but IIRC if the clobber is loops with inline assembly payloads can be > > de-optimized if GCC can't prove the iterations don't affect each other. > > > > > > > > > > Personally, I find the "memory" constraint to be more readable than these casts > > > "m"(*(const char(*)[64])buff)); > > > > > > > Hmm, I personally find it more readable if I can tell what memory > > transforms happen > > just from reading the clobbers, but you're the maintainer. > > I am not x86 maintainer at all, I was only giving an opinion. > > > > > Do you want it changed in V2? > > Let's hear from x86 maintainers :) > > > > > > > > > > > Relative performance changes on Tigerlake: > > > > > > > > Time Unit: Ref Cycles > > > > Size Unit: Bytes > > > > > > > > size, lat old, lat new, tput old, tput new > > > > 0, 4.972, 5.054, 4.864, 4.870 > > > > > > Really what matters in modern networking is the case for 40 bytes, and > > > eventually 8 bytes. > > > > > > Can you add these two cases in this nice table ? > > > > > > > Sure, with your suggestion in the 32 byte cases there is an improvement there > > too. > > > > > We hardly have to checksum anything with NIC that are not decades old. > > > > > > Apparently making the 64byte loop slightly longer incentives gcc to > > > move it away (our intent with the unlikely() hint). > > > > > > Anyway I am thinking of providing a specialized inline version for > > > IPv6 header checksums (40 + x*8 bytes, x being 0 pretty much all the > > > time), > > > so we will likely not use csum_partial() anymore. > > > > I see. For now is it worth adding a case for 40 in this implementation? > > > > if(likely(len == 40)) { > > // optimized 40 + buff aligned case > > } > > else { > > // existing code > > } > > This is what I had first in my tree to demonstrate what gains could be achieved. > > But decided to keep csum_partial() as a generic function, yet optimize it a bit. > > My plan is to have a new inline helper, that could be a mere > csum_partial() on arches that do not have > an optimized version. > > Preliminary patch to show the idea (this only compiles on x86, we have > to add the generic definition) > First call point is in GRO stack. > Note some of us in the US are not working until next Monday. > > diff --git a/arch/x86/include/asm/checksum_64.h > b/arch/x86/include/asm/checksum_64.h > index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb > 100644 > --- a/arch/x86/include/asm/checksum_64.h > +++ b/arch/x86/include/asm/checksum_64.h > @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend) > (__force unsigned)addend); > } > > +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum) > +{ > + u64 temp64; > + > + if (unlikely(len == 40)) > + return csum_partial(buff, len, sum); > + > + temp64 = (__force u64)sum; > + asm("addq 0*8(%[src]),%[res]\n\t" > + "adcq 1*8(%[src]),%[res]\n\t" > + "adcq 2*8(%[src]),%[res]\n\t" > + "adcq 3*8(%[src]),%[res]\n\t" > + "adcq 4*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq $0,%[res]" > + : [res] "+r" (temp64) > + : [src] "r" (buff) > + : "memory"); > + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64); Makes sense. Although if you inline I think you definitely will want a more conservative clobber than just "memory". Also I think with 40 you also will get some value from two counters. Did you see the number/question I posted about two accumulators for 32 byte case? Its a judgement call about latency vs throughput that I don't really have an answer for. > +} > +#define ipv6_csum_partial ipv6_csum_partial > + > #endif /* _ASM_X86_CHECKSUM_64_H */ > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index 1b9827ff8ccf48e61e233e39d671aa67c8fff0ab..50d1d0c92bdce37ac2f48ded8edec02e1ba9096d > 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -274,7 +274,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff > *ipv6_gro_receive(struct list_head *head, > NAPI_GRO_CB(skb)->is_atomic = true; > NAPI_GRO_CB(skb)->flush |= flush; > > - skb_gro_postpull_rcsum(skb, iph, nlen); > + if (NAPI_GRO_CB(skb)->csum_valid) > + NAPI_GRO_CB(skb)->csum = ~ipv6_csum_partial(iph, nlen, > + ~NAPI_GRO_CB(skb)->csum); > > pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive, > ops->callbacks.gro_receive, head, skb); ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 18:17 ` Noah Goldstein @ 2021-11-26 18:27 ` Eric Dumazet 2021-11-26 18:50 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 18:27 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > Makes sense. Although if you inline I think you definitely will want a more > conservative clobber than just "memory". Also I think with 40 you also will > get some value from two counters. > > Did you see the number/question I posted about two accumulators for 32 > byte case? > Its a judgement call about latency vs throughput that I don't really have an > answer for. > The thing I do not know is if using more units would slow down the hyper thread ? Would using ADCX/ADOX would be better in this respect ? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 18:27 ` Eric Dumazet @ 2021-11-26 18:50 ` Noah Goldstein 2021-11-26 19:14 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 18:50 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > Makes sense. Although if you inline I think you definitely will want a more > > conservative clobber than just "memory". Also I think with 40 you also will > > get some value from two counters. > > > > Did you see the number/question I posted about two accumulators for 32 > > byte case? > > Its a judgement call about latency vs throughput that I don't really have an > > answer for. > > > > The thing I do not know is if using more units would slow down the > hyper thread ? There are more uops in the two accumulator version so it could be concern iff the other hyperthread is bottlenecked on p06 throughput. My general understanding is this is not the common case and that the very premise of hyperthreads is that most bottlenecks are related to memory fetch or resolving control flow. > > Would using ADCX/ADOX would be better in this respect ? What would code using those instructions look like? Having trouble seeing how to use them here. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 18:50 ` Noah Goldstein @ 2021-11-26 19:14 ` Noah Goldstein 2021-11-26 19:21 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 19:14 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > Makes sense. Although if you inline I think you definitely will want a more > > > conservative clobber than just "memory". Also I think with 40 you also will > > > get some value from two counters. > > > > > > Did you see the number/question I posted about two accumulators for 32 > > > byte case? > > > Its a judgement call about latency vs throughput that I don't really have an > > > answer for. > > > > > > > The thing I do not know is if using more units would slow down the > > hyper thread ? Did some quick tests with the latency/throughput benchmarks running in parallel on two hyperthreads on the same processors. The 32 byte case latency advantage goes with 2 accum and there is still a slight regression in throughput. The larger cases that hit the loop still still have improvements both in tput and latency with 2 accum. > > There are more uops in the two accumulator version so it could be concern > iff the other hyperthread is bottlenecked on p06 throughput. My general > understanding is this is not the common case and that the very premise of > hyperthreads is that most bottlenecks are related to memory fetch or resolving > control flow. > > > > > Would using ADCX/ADOX would be better in this respect ? > > What would code using those instructions look like? Having trouble > seeing how to use them here. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 19:14 ` Noah Goldstein @ 2021-11-26 19:21 ` Eric Dumazet 2021-11-26 19:50 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 19:21 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 11:14 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > > > > > Makes sense. Although if you inline I think you definitely will want a more > > > > conservative clobber than just "memory". Also I think with 40 you also will > > > > get some value from two counters. > > > > > > > > Did you see the number/question I posted about two accumulators for 32 > > > > byte case? > > > > Its a judgement call about latency vs throughput that I don't really have an > > > > answer for. > > > > > > > > > > The thing I do not know is if using more units would slow down the > > > hyper thread ? > > Did some quick tests with the latency/throughput benchmarks running > in parallel on two hyperthreads on the same processors. The 32 byte case > latency advantage goes with 2 accum and there is still a slight regression > in throughput. The larger cases that hit the loop still still have improvements > both in tput and latency with 2 accum. Great. I also played with rorx instruction, because it removes one "adcl $0,..." step __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum) { u64 temp64; u64 tmp; u32 res32; if (unlikely(len != 40)) return csum_partial(buff, len, sum); temp64 = (__force u64)sum; asm("addq 0*8(%[src]),%[temp64]\n\t" "adcq 1*8(%[src]),%[temp64]\n\t" "adcq 2*8(%[src]),%[temp64]\n\t" "adcq 3*8(%[src]),%[temp64]\n\t" "adcq 4*8(%[src]),%[temp64]\n\t" "mov %k[temp64],%[res32]\n\t" "rorx $32,%[temp64],%[temp64]\n\t" "adcl %k[temp64],%[res32]\n\t" "adcl $0,%[res32]" : [temp64] "+r" (temp64), [res32] "=r" (res32) : [src] "r" (buff) : "memory"); return (__force __wsum)res32; } ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 19:21 ` Eric Dumazet @ 2021-11-26 19:50 ` Noah Goldstein 2021-11-26 20:07 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 19:50 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 1:21 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 11:14 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Fri, Nov 26, 2021 at 12:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On Fri, Nov 26, 2021 at 12:27 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Fri, Nov 26, 2021 at 10:17 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > Makes sense. Although if you inline I think you definitely will want a more > > > > > conservative clobber than just "memory". Also I think with 40 you also will > > > > > get some value from two counters. > > > > > > > > > > Did you see the number/question I posted about two accumulators for 32 > > > > > byte case? > > > > > Its a judgement call about latency vs throughput that I don't really have an > > > > > answer for. > > > > > > > > > > > > > The thing I do not know is if using more units would slow down the > > > > hyper thread ? > > > > Did some quick tests with the latency/throughput benchmarks running > > in parallel on two hyperthreads on the same processors. The 32 byte case > > latency advantage goes with 2 accum and there is still a slight regression > > in throughput. The larger cases that hit the loop still still have improvements > > both in tput and latency with 2 accum. > > Great. I also played with rorx instruction, because it removes one > "adcl $0,..." step Bright :) but it will need a BMI support check. > > __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum) > { > u64 temp64; > u64 tmp; > u32 res32; > > if (unlikely(len != 40)) > return csum_partial(buff, len, sum); > > temp64 = (__force u64)sum; > asm("addq 0*8(%[src]),%[temp64]\n\t" > "adcq 1*8(%[src]),%[temp64]\n\t" > "adcq 2*8(%[src]),%[temp64]\n\t" > "adcq 3*8(%[src]),%[temp64]\n\t" > "adcq 4*8(%[src]),%[temp64]\n\t" > "mov %k[temp64],%[res32]\n\t" > "rorx $32,%[temp64],%[temp64]\n\t" > "adcl %k[temp64],%[res32]\n\t" > "adcl $0,%[res32]" > : [temp64] "+r" (temp64), [res32] "=r" (res32) > : [src] "r" (buff) > : "memory"); > return (__force __wsum)res32; > } I actually get better performance in hyperthread benchmarks with 2 accum: Used: u64 res; temp64 = (__force uint64_t)sum; asm("movq 0*8(%[src]),%[res]\n\t" "addq 1*8(%[src]),%[res]\n\t" "adcq 2*8(%[src]),%[res]\n\t" "adcq $0, %[res]\n" "addq 3*8(%[src]),%[temp64]\n\t" "adcq 4*8(%[src]),%[temp64]\n\t" "adcq %[res], %[temp64]\n\t" "mov %k[temp64],%k[res]\n\t" "rorx $32,%[temp64],%[temp64]\n\t" "adcl %k[temp64],%k[res]\n\t" "adcl $0,%k[res]" : [temp64] "+r"(temp64), [res] "=&r"(res) : [src] "r"(buff) : "memory"); return (__force __wsum)res; w/ hyperthread: size, 2acc lat, 1acc lat, 2acc tput, 1acc tput 40, 6.511, 7.863, 6.177, 6.157 w/o hyperthread: size, 2acc lat, 1acc lat, 2acc tput, 1acc tput 40, 5.577, 6.764, 3.150, 3.210 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 19:50 ` Noah Goldstein @ 2021-11-26 20:07 ` Eric Dumazet 2021-11-26 20:33 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 20:07 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Bright :) but it will need a BMI support check. Yes, probably not worth the pain. > > I actually get better performance in hyperthread benchmarks with 2 accum: > > Used: > > u64 res; > temp64 = (__force uint64_t)sum; > asm("movq 0*8(%[src]),%[res]\n\t" > "addq 1*8(%[src]),%[res]\n\t" > "adcq 2*8(%[src]),%[res]\n\t" > "adcq $0, %[res]\n" > "addq 3*8(%[src]),%[temp64]\n\t" > "adcq 4*8(%[src]),%[temp64]\n\t" > "adcq %[res], %[temp64]\n\t" > "mov %k[temp64],%k[res]\n\t" > "rorx $32,%[temp64],%[temp64]\n\t" > "adcl %k[temp64],%k[res]\n\t" > "adcl $0,%k[res]" > : [temp64] "+r"(temp64), [res] "=&r"(res) > : [src] "r"(buff) > : "memory"); > return (__force __wsum)res; > > w/ hyperthread: > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > 40, 6.511, 7.863, 6.177, 6.157 > > w/o hyperthread: > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > 40, 5.577, 6.764, 3.150, 3.210 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 20:07 ` Eric Dumazet @ 2021-11-26 20:33 ` Noah Goldstein 2021-11-27 0:15 ` Eric Dumazet 0 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-26 20:33 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Bright :) but it will need a BMI support check. > > Yes, probably not worth the pain. Making a V2 for my patch with your optimization for the loop case. Do you think 1 or 2 accum for the 32 byte case? > > > > > I actually get better performance in hyperthread benchmarks with 2 accum: > > > > Used: > > > > u64 res; > > temp64 = (__force uint64_t)sum; > > asm("movq 0*8(%[src]),%[res]\n\t" > > "addq 1*8(%[src]),%[res]\n\t" > > "adcq 2*8(%[src]),%[res]\n\t" > > "adcq $0, %[res]\n" > > "addq 3*8(%[src]),%[temp64]\n\t" > > "adcq 4*8(%[src]),%[temp64]\n\t" > > "adcq %[res], %[temp64]\n\t" > > "mov %k[temp64],%k[res]\n\t" > > "rorx $32,%[temp64],%[temp64]\n\t" > > "adcl %k[temp64],%k[res]\n\t" > > "adcl $0,%k[res]" > > : [temp64] "+r"(temp64), [res] "=&r"(res) > > : [src] "r"(buff) > > : "memory"); > > return (__force __wsum)res; > > > > w/ hyperthread: > > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > > 40, 6.511, 7.863, 6.177, 6.157 > > > > w/o hyperthread: > > size, 2acc lat, 1acc lat, 2acc tput, 1acc tput > > 40, 5.577, 6.764, 3.150, 3.210 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 20:33 ` Noah Goldstein @ 2021-11-27 0:15 ` Eric Dumazet 2021-11-27 0:39 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-27 0:15 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 12:33 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > Bright :) but it will need a BMI support check. > > > > Yes, probably not worth the pain. > > Making a V2 for my patch with your optimization for the loop case. Do you think > 1 or 2 accum for the 32 byte case? > I would vote for something simpler, thus one accum, since this 32byte block is only run one time ? Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-27 0:15 ` Eric Dumazet @ 2021-11-27 0:39 ` Noah Goldstein 0 siblings, 0 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-27 0:39 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 6:15 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 12:33 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Fri, Nov 26, 2021 at 2:07 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Fri, Nov 26, 2021 at 11:50 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > Bright :) but it will need a BMI support check. > > > > > > Yes, probably not worth the pain. > > > > Making a V2 for my patch with your optimization for the loop case. Do you think > > 1 or 2 accum for the 32 byte case? > > > > I would vote for something simpler, thus one accum, since this 32byte > block is only run one time ? If the one at a time performance is whats the most important wouldn't that argue in favor of 2x accum because it lead to decreased latency? Or are you saying it's not that important so simpler codes is the priority? > > Thanks ! ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-26 16:08 ` Eric Dumazet 2021-11-26 18:17 ` Noah Goldstein @ 2021-11-26 18:17 ` Eric Dumazet 1 sibling, 0 replies; 30+ messages in thread From: Eric Dumazet @ 2021-11-26 18:17 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Fri, Nov 26, 2021 at 8:08 AM Eric Dumazet <edumazet@google.com> wrote: > diff --git a/arch/x86/include/asm/checksum_64.h > b/arch/x86/include/asm/checksum_64.h > index 407beebadaf45a748f91a36b78bd1d023449b132..af93fb53b480ab7102db71c32ab6ca9604c6b5fb > 100644 > --- a/arch/x86/include/asm/checksum_64.h > +++ b/arch/x86/include/asm/checksum_64.h > @@ -182,4 +182,26 @@ static inline __wsum csum_add(__wsum csum, __wsum addend) > (__force unsigned)addend); > } > > +static inline __wsum ipv6_csum_partial(const void *buff, int len, __wsum sum) > +{ > + u64 temp64; > + > + if (unlikely(len == 40)) this of course needs to be the opposite condition if (unlikely(len != sizeof(struct ipv6hdr)) > + return csum_partial(buff, len, sum); > + > + temp64 = (__force u64)sum; > + asm("addq 0*8(%[src]),%[res]\n\t" > + "adcq 1*8(%[src]),%[res]\n\t" > + "adcq 2*8(%[src]),%[res]\n\t" > + "adcq 3*8(%[src]),%[res]\n\t" > + "adcq 4*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq $0,%[res]" > + : [res] "+r" (temp64) > + : [src] "r" (buff) > + : "memory"); > + return (__force __wsum)add32_with_carry(temp64 >> 32, temp64); > +} ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein 2021-11-26 1:50 ` Eric Dumazet @ 2021-11-27 4:25 ` Noah Goldstein 2021-11-27 6:03 ` Eric Dumazet 2021-11-27 6:39 ` [PATCH v3] " Noah Goldstein 2 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-27 4:25 UTC (permalink / raw) Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, goldstein.w.n, edumazet, linux-kernel Modify the 8x loop to that it uses two independent accumulators. Despite adding more instructions the latency and throughput of the loop is improved because the `adc` chains can now take advantage of multiple execution units. Make the memory clobbers more precise. 'buff' is read only and we know the exact usage range. There is no reason to write-clobber all memory. Relative performance changes on Tigerlake: Time Unit: Ref Cycles Size Unit: Bytes size, lat old, lat new, tput old, tput new 0, 4.961, 4.901, 4.887, 4.951 8, 5.590, 5.620, 4.227, 4.252 16, 6.182, 6.202, 4.233, 4.278 24, 7.392, 7.380, 4.256, 4.279 32, 7.371, 7.390, 4.550, 4.537 40, 8.621, 8.601, 4.862, 4.836 48, 9.406, 9.374, 5.206, 5.234 56, 10.535, 10.522, 5.416, 5.447 64, 10.000, 7.590, 6.946, 6.989 100, 14.218, 12.476, 9.429, 9.441 200, 22.115, 16.937, 13.088, 12.852 300, 31.826, 24.640, 19.383, 18.230 400, 39.016, 28.133, 23.223, 21.304 500, 48.815, 36.186, 30.331, 27.104 600, 56.732, 40.120, 35.899, 30.363 700, 66.623, 48.178, 43.044, 36.400 800, 73.259, 51.171, 48.564, 39.173 900, 82.821, 56.635, 58.592, 45.162 1000, 90.780, 63.703, 65.658, 48.718 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> tmp --- arch/x86/lib/csum-partial_64.c | 38 +++++++++++++++++----------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index ded842cd1020..52540f148ebb 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) } while (unlikely(len >= 64)) { - asm("addq 0*8(%[src]),%[res]\n\t" - "adcq 1*8(%[src]),%[res]\n\t" - "adcq 2*8(%[src]),%[res]\n\t" - "adcq 3*8(%[src]),%[res]\n\t" - "adcq 4*8(%[src]),%[res]\n\t" - "adcq 5*8(%[src]),%[res]\n\t" + u64 temp_accum; + + asm("movq 0*8(%[src]),%[res_tmp]\n\t" + "addq 1*8(%[src]),%[res_tmp]\n\t" + "adcq 2*8(%[src]),%[res_tmp]\n\t" + "adcq 3*8(%[src]),%[res_tmp]\n\t" + "adcq 4*8(%[src]),%[res_tmp]\n\t" + "adcq $0,%[res_tmp]\n\t" + "addq 5*8(%[src]),%[res]\n\t" "adcq 6*8(%[src]),%[res]\n\t" "adcq 7*8(%[src]),%[res]\n\t" - "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + "adcq %[res_tmp], %[res]\n\t" + "adcq $0,%[res]\n\t" + : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum) + : [src] "r"(buff), "m"(*(const char(*)[64])buff)); buff += 64; len -= 64; } @@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) "adcq 2*8(%[src]),%[res]\n\t" "adcq 3*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[32])buff)); buff += 32; } if (len & 16) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq 1*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[16])buff)); buff += 16; } if (len & 8) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[8])buff)); buff += 8; } if (len & 7) { -- 2.25.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-27 4:25 ` [PATCH v2] " Noah Goldstein @ 2021-11-27 6:03 ` Eric Dumazet 2021-11-27 6:38 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-27 6:03 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, linux-kernel On Fri, Nov 26, 2021 at 8:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Modify the 8x loop to that it uses two independent > accumulators. Despite adding more instructions the latency and > throughput of the loop is improved because the `adc` chains can now > take advantage of multiple execution units. > > Make the memory clobbers more precise. 'buff' is read only and we know > the exact usage range. There is no reason to write-clobber all memory. > > Relative performance changes on Tigerlake: > > Time Unit: Ref Cycles > Size Unit: Bytes > > size, lat old, lat new, tput old, tput new > 0, 4.961, 4.901, 4.887, 4.951 > 8, 5.590, 5.620, 4.227, 4.252 > 16, 6.182, 6.202, 4.233, 4.278 > 24, 7.392, 7.380, 4.256, 4.279 > 32, 7.371, 7.390, 4.550, 4.537 > 40, 8.621, 8.601, 4.862, 4.836 > 48, 9.406, 9.374, 5.206, 5.234 > 56, 10.535, 10.522, 5.416, 5.447 > 64, 10.000, 7.590, 6.946, 6.989 > 100, 14.218, 12.476, 9.429, 9.441 > 200, 22.115, 16.937, 13.088, 12.852 > 300, 31.826, 24.640, 19.383, 18.230 > 400, 39.016, 28.133, 23.223, 21.304 > 500, 48.815, 36.186, 30.331, 27.104 > 600, 56.732, 40.120, 35.899, 30.363 > 700, 66.623, 48.178, 43.044, 36.400 > 800, 73.259, 51.171, 48.564, 39.173 > 900, 82.821, 56.635, 58.592, 45.162 > 1000, 90.780, 63.703, 65.658, 48.718 > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > > tmp SGTM (not sure what this 'tmp' string means here :) ) Reviewed-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-27 6:03 ` Eric Dumazet @ 2021-11-27 6:38 ` Noah Goldstein 0 siblings, 0 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-27 6:38 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Sat, Nov 27, 2021 at 12:03 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 8:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Modify the 8x loop to that it uses two independent > > accumulators. Despite adding more instructions the latency and > > throughput of the loop is improved because the `adc` chains can now > > take advantage of multiple execution units. > > > > Make the memory clobbers more precise. 'buff' is read only and we know > > the exact usage range. There is no reason to write-clobber all memory. > > > > Relative performance changes on Tigerlake: > > > > Time Unit: Ref Cycles > > Size Unit: Bytes > > > > size, lat old, lat new, tput old, tput new > > 0, 4.961, 4.901, 4.887, 4.951 > > 8, 5.590, 5.620, 4.227, 4.252 > > 16, 6.182, 6.202, 4.233, 4.278 > > 24, 7.392, 7.380, 4.256, 4.279 > > 32, 7.371, 7.390, 4.550, 4.537 > > 40, 8.621, 8.601, 4.862, 4.836 > > 48, 9.406, 9.374, 5.206, 5.234 > > 56, 10.535, 10.522, 5.416, 5.447 > > 64, 10.000, 7.590, 6.946, 6.989 > > 100, 14.218, 12.476, 9.429, 9.441 > > 200, 22.115, 16.937, 13.088, 12.852 > > 300, 31.826, 24.640, 19.383, 18.230 > > 400, 39.016, 28.133, 23.223, 21.304 > > 500, 48.815, 36.186, 30.331, 27.104 > > 600, 56.732, 40.120, 35.899, 30.363 > > 700, 66.623, 48.178, 43.044, 36.400 > > 800, 73.259, 51.171, 48.564, 39.173 > > 900, 82.821, 56.635, 58.592, 45.162 > > 1000, 90.780, 63.703, 65.658, 48.718 > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > > > > tmp > > SGTM (not sure what this 'tmp' string means here :) ) > > Reviewed-by: Eric Dumazet <edumazet@google.com> Poor rebasing practices :/ Fixed in V3 (only change). ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein 2021-11-26 1:50 ` Eric Dumazet 2021-11-27 4:25 ` [PATCH v2] " Noah Goldstein @ 2021-11-27 6:39 ` Noah Goldstein 2021-11-27 6:51 ` Eric Dumazet 2 siblings, 1 reply; 30+ messages in thread From: Noah Goldstein @ 2021-11-27 6:39 UTC (permalink / raw) Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, goldstein.w.n, edumazet, linux-kernel Modify the 8x loop to that it uses two independent accumulators. Despite adding more instructions the latency and throughput of the loop is improved because the `adc` chains can now take advantage of multiple execution units. Make the memory clobbers more precise. 'buff' is read only and we know the exact usage range. There is no reason to write-clobber all memory. Relative performance changes on Tigerlake: Time Unit: Ref Cycles Size Unit: Bytes size, lat old, lat new, tput old, tput new 0, 4.961, 4.901, 4.887, 4.951 8, 5.590, 5.620, 4.227, 4.252 16, 6.182, 6.202, 4.233, 4.278 24, 7.392, 7.380, 4.256, 4.279 32, 7.371, 7.390, 4.550, 4.537 40, 8.621, 8.601, 4.862, 4.836 48, 9.406, 9.374, 5.206, 5.234 56, 10.535, 10.522, 5.416, 5.447 64, 10.000, 7.590, 6.946, 6.989 100, 14.218, 12.476, 9.429, 9.441 200, 22.115, 16.937, 13.088, 12.852 300, 31.826, 24.640, 19.383, 18.230 400, 39.016, 28.133, 23.223, 21.304 500, 48.815, 36.186, 30.331, 27.104 600, 56.732, 40.120, 35.899, 30.363 700, 66.623, 48.178, 43.044, 36.400 800, 73.259, 51.171, 48.564, 39.173 900, 82.821, 56.635, 58.592, 45.162 1000, 90.780, 63.703, 65.658, 48.718 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> --- arch/x86/lib/csum-partial_64.c | 38 +++++++++++++++++----------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index ded842cd1020..52540f148ebb 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -48,18 +48,21 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) } while (unlikely(len >= 64)) { - asm("addq 0*8(%[src]),%[res]\n\t" - "adcq 1*8(%[src]),%[res]\n\t" - "adcq 2*8(%[src]),%[res]\n\t" - "adcq 3*8(%[src]),%[res]\n\t" - "adcq 4*8(%[src]),%[res]\n\t" - "adcq 5*8(%[src]),%[res]\n\t" + u64 temp_accum; + + asm("movq 0*8(%[src]),%[res_tmp]\n\t" + "addq 1*8(%[src]),%[res_tmp]\n\t" + "adcq 2*8(%[src]),%[res_tmp]\n\t" + "adcq 3*8(%[src]),%[res_tmp]\n\t" + "adcq 4*8(%[src]),%[res_tmp]\n\t" + "adcq $0,%[res_tmp]\n\t" + "addq 5*8(%[src]),%[res]\n\t" "adcq 6*8(%[src]),%[res]\n\t" "adcq 7*8(%[src]),%[res]\n\t" - "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + "adcq %[res_tmp], %[res]\n\t" + "adcq $0,%[res]\n\t" + : [res] "+r"(temp64), [res_tmp] "=&r"(temp_accum) + : [src] "r"(buff), "m"(*(const char(*)[64])buff)); buff += 64; len -= 64; } @@ -70,26 +73,23 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) "adcq 2*8(%[src]),%[res]\n\t" "adcq 3*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[32])buff)); buff += 32; } if (len & 16) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq 1*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[16])buff)); buff += 16; } if (len & 8) { asm("addq 0*8(%[src]),%[res]\n\t" "adcq $0,%[res]" - : [res] "+r" (temp64) - : [src] "r" (buff) - : "memory"); + : [res] "+r"(temp64) + : [src] "r"(buff), "m"(*(const char(*)[8])buff)); buff += 8; } if (len & 7) { -- 2.25.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v3] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-27 6:39 ` [PATCH v3] " Noah Goldstein @ 2021-11-27 6:51 ` Eric Dumazet 2021-11-27 7:18 ` Noah Goldstein 0 siblings, 1 reply; 30+ messages in thread From: Eric Dumazet @ 2021-11-27 6:51 UTC (permalink / raw) To: Noah Goldstein Cc: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, alexanderduyck, linux-kernel On Fri, Nov 26, 2021 at 10:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Modify the 8x loop to that it uses two independent > accumulators. Despite adding more instructions the latency and > throughput of the loop is improved because the `adc` chains can now > take advantage of multiple execution units. Oh well, there was really no need to resend this, especially if you do not add my ack. Reviewed-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c 2021-11-27 6:51 ` Eric Dumazet @ 2021-11-27 7:18 ` Noah Goldstein 0 siblings, 0 replies; 30+ messages in thread From: Noah Goldstein @ 2021-11-27 7:18 UTC (permalink / raw) To: Eric Dumazet Cc: tglx, mingo, Borislav Petkov, dave.hansen, X86 ML, hpa, peterz, alexanderduyck, open list On Sat, Nov 27, 2021 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Nov 26, 2021 at 10:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Modify the 8x loop to that it uses two independent > > accumulators. Despite adding more instructions the latency and > > throughput of the loop is improved because the `adc` chains can now > > take advantage of multiple execution units. > > Oh well, there was really no need to resend this, especially if you do > not add my ack. > > Reviewed-by: Eric Dumazet <edumazet@google.com> Sorry! Is the protocol for tags to let the maintainer add them or should I post a V4 with your tag? ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2021-12-02 21:11 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein 2021-11-26 1:50 ` Eric Dumazet 2021-11-26 2:15 ` Noah Goldstein 2021-11-26 2:18 ` Noah Goldstein 2021-11-26 2:38 ` Noah Goldstein 2021-11-28 19:47 ` David Laight 2021-11-28 20:59 ` Noah Goldstein 2021-11-28 22:41 ` David Laight 2021-12-02 14:24 ` David Laight 2021-12-02 15:01 ` Eric Dumazet 2021-12-02 20:19 ` Noah Goldstein 2021-12-02 21:11 ` David Laight 2021-11-26 16:08 ` Eric Dumazet 2021-11-26 18:17 ` Noah Goldstein 2021-11-26 18:27 ` Eric Dumazet 2021-11-26 18:50 ` Noah Goldstein 2021-11-26 19:14 ` Noah Goldstein 2021-11-26 19:21 ` Eric Dumazet 2021-11-26 19:50 ` Noah Goldstein 2021-11-26 20:07 ` Eric Dumazet 2021-11-26 20:33 ` Noah Goldstein 2021-11-27 0:15 ` Eric Dumazet 2021-11-27 0:39 ` Noah Goldstein 2021-11-26 18:17 ` Eric Dumazet 2021-11-27 4:25 ` [PATCH v2] " Noah Goldstein 2021-11-27 6:03 ` Eric Dumazet 2021-11-27 6:38 ` Noah Goldstein 2021-11-27 6:39 ` [PATCH v3] " Noah Goldstein 2021-11-27 6:51 ` Eric Dumazet 2021-11-27 7:18 ` Noah Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).