* [PATCH] lib/int_sqrt.c: Optimize square root function @ 2015-02-02 17:12 Anshul Garg 2015-02-02 19:00 ` Linus Torvalds 2015-02-02 19:10 ` Joe Perches 0 siblings, 2 replies; 27+ messages in thread From: Anshul Garg @ 2015-02-02 17:12 UTC (permalink / raw) To: linux-kernel; +Cc: aksgarg1989, anshul.g, torvalds From: Anshul Garg <aksgarg1989@gmail.com> Unnecessary instructions are executing even though m is greater than x so added logic to make m less than equal to x before performing these operations. Signed-off-by: Anshul Garg <aksgarg1989@gmail.com> --- lib/int_sqrt.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc3..64ae722 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x) return x; m = 1UL << (BITS_PER_LONG - 2); + + while (m > x) + m >>= 2; while (m != 0) { b = y + m; y >>= 1; -- 1.7.9.5 --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 17:12 [PATCH] lib/int_sqrt.c: Optimize square root function Anshul Garg @ 2015-02-02 19:00 ` Linus Torvalds 2015-02-02 19:13 ` Linus Torvalds 2015-02-03 4:47 ` Davidlohr Bueso 2015-02-02 19:10 ` Joe Perches 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2015-02-02 19:00 UTC (permalink / raw) To: Anshul Garg, Davidlohr Bueso; +Cc: Linux Kernel Mailing List, anshul.g Hmm. I don't disagree, but would like some more feedback. Davidlohr - you were the person to touch this function last (commit 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and you did so for performance reasons. And in fact, when you did that, you removed that initial loop: - one = 1UL << (BITS_PER_LONG - 2); - while (one > op) - one >>= 2; but I'm not sure that was actually all that conscious, I think the real optimization was the changes inside the loop to make the final real loop faster and simpler. Also, you had performance numbers, so presumably a test harness for it all. It probably depends a lot on the actual distribution of argument values, of course, but it would be good to accompany the patch with actual real numbers like lasty time. (I'm also not entirely sure what uses int_sqrt() that ends up being so performance-critical, so it would be good to document that too, since that probably also matters for the "what's the normal argument range" question..) Linus On Mon, Feb 2, 2015 at 9:12 AM, Anshul Garg <aksgarg1989@gmail.com> wrote: > From: Anshul Garg <aksgarg1989@gmail.com> > > Unnecessary instructions are executing even though m is > greater than x so added logic to make m less than equal to > x before performing these operations. > > Signed-off-by: Anshul Garg <aksgarg1989@gmail.com> > --- > lib/int_sqrt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c > index 1ef4cc3..64ae722 100644 > --- a/lib/int_sqrt.c > +++ b/lib/int_sqrt.c > @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x) > return x; > > m = 1UL << (BITS_PER_LONG - 2); > + > + while (m > x) > + m >>= 2; > while (m != 0) { > b = y + m; > y >>= 1; > -- > 1.7.9.5 > > > --- > This email has been checked for viruses by Avast antivirus software. > http://www.avast.com > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 19:00 ` Linus Torvalds @ 2015-02-02 19:13 ` Linus Torvalds 2015-02-03 4:41 ` Davidlohr Bueso 2017-07-20 11:24 ` Peter Zijlstra 2015-02-03 4:47 ` Davidlohr Bueso 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2015-02-02 19:13 UTC (permalink / raw) To: Anshul Garg, Davidlohr Bueso; +Cc: Linux Kernel Mailing List, anshul.g On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > (I'm also not entirely sure what uses int_sqrt() that ends up being so > performance-critical, so it would be good to document that too, since > that probably also matters for the "what's the normal argument range" > question..) ... it's also not entirely clear that we need a whole new loop. We might just instead start off with a better guess for 'm' using some calculation that might be doable with a single conditional move instruction instead of a loop. Because I suspect that the inevitable branch misprediction of a new loop is likely as expensive as a few iterations through the core one. IOW, instead of m = 1UL << (BITS_PER_LONG - 2); perhaps something like m = 1UL << (BITS_PER_LONG/2- 2); if (m < x) m <<= BITS_PER_LONG/2; (assuming gcc can change that code into a "cmov") might cut down the "lots of empty loops" case in half for small values of 'x'. There's probably some other better cheap initial guess value estimator. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 19:13 ` Linus Torvalds @ 2015-02-03 4:41 ` Davidlohr Bueso 2015-02-03 4:57 ` Davidlohr Bueso 2015-02-03 15:54 ` Anshul Garg 2017-07-20 11:24 ` Peter Zijlstra 1 sibling, 2 replies; 27+ messages in thread From: Davidlohr Bueso @ 2015-02-03 4:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote: > IOW, instead of > > m = 1UL << (BITS_PER_LONG - 2); > > perhaps something like > > m = 1UL << (BITS_PER_LONG/2- 2); > if (m < x) > m <<= BITS_PER_LONG/2; > > (assuming gcc can change that code into a "cmov") might cut down the > "lots of empty loops" case in half for small values of 'x'. Makes a lot of sense. > There's probably some other better cheap initial guess value estimator. Just to get a feeling for the normal arg range in the non-driver parts that use this thing: fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); fs/nfsd/nfscache.c: limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10); kernel/rcu/tree_plugin.h: ls = int_sqrt(nr_cpu_ids); mm/memcontrol.c: inactive_ratio = int_sqrt(10 * gb); mm/page_alloc.c: ratio = int_sqrt(10 * gb); mm/page_alloc.c: new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16); So mostly values that scale according to mem or cpu. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-03 4:41 ` Davidlohr Bueso @ 2015-02-03 4:57 ` Davidlohr Bueso 2015-02-03 15:54 ` Anshul Garg 1 sibling, 0 replies; 27+ messages in thread From: Davidlohr Bueso @ 2015-02-03 4:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g On Mon, 2015-02-02 at 20:41 -0800, Davidlohr Bueso wrote: > On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote: > > IOW, instead of > > > > m = 1UL << (BITS_PER_LONG - 2); > > > > perhaps something like > > > > m = 1UL << (BITS_PER_LONG/2- 2); > > if (m < x) > > m <<= BITS_PER_LONG/2; Assuming small values mostly, we could also try more aggressive estimators for BITS_PER_LONG == 64. But then again, it probably doesn't matter. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-03 4:41 ` Davidlohr Bueso 2015-02-03 4:57 ` Davidlohr Bueso @ 2015-02-03 15:54 ` Anshul Garg 1 sibling, 0 replies; 27+ messages in thread From: Anshul Garg @ 2015-02-03 15:54 UTC (permalink / raw) To: Davidlohr Bueso; +Cc: Linus Torvalds, Linux Kernel Mailing List, anshul.g On Tue, Feb 3, 2015 at 10:11 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: > On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote: >> IOW, instead of >> >> m = 1UL << (BITS_PER_LONG - 2); >> >> perhaps something like >> >> m = 1UL << (BITS_PER_LONG/2- 2); >> if (m < x) >> m <<= BITS_PER_LONG/2; On doing this change also extra instruction(shown below) will execute if x is small(say 10000). b = y + m; y >>= 1; if (x >= b) { Although i understand your point that we can set value of m according to range of x. But in future also range of x can change if apart from memory some other module starts using this function more extensively. So in worst case it will be almost same as current implementation. So i think its better to scale m according to x by doing while(m > x) m>>=2; rather than according to range of x. Thanks Anshul Garg >> >> (assuming gcc can change that code into a "cmov") might cut down the >> "lots of empty loops" case in half for small values of 'x'. > > Makes a lot of sense. > >> There's probably some other better cheap initial guess value estimator. > > Just to get a feeling for the normal arg range in the non-driver parts > that use this thing: > > fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); > fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); > fs/nfsd/nfscache.c: limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10); > kernel/rcu/tree_plugin.h: ls = int_sqrt(nr_cpu_ids); > mm/memcontrol.c: inactive_ratio = int_sqrt(10 * gb); > mm/page_alloc.c: ratio = int_sqrt(10 * gb); > mm/page_alloc.c: new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16); > > So mostly values that scale according to mem or cpu. > > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 19:13 ` Linus Torvalds 2015-02-03 4:41 ` Davidlohr Bueso @ 2017-07-20 11:24 ` Peter Zijlstra 2017-07-20 11:52 ` Joe Perches ` (2 more replies) 1 sibling, 3 replies; 27+ messages in thread From: Peter Zijlstra @ 2017-07-20 11:24 UTC (permalink / raw) To: Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, joe On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote: > On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > (I'm also not entirely sure what uses int_sqrt() that ends up being so > > performance-critical, so it would be good to document that too, since > > that probably also matters for the "what's the normal argument range" > > question..) > > ... it's also not entirely clear that we need a whole new loop. We > might just instead start off with a better guess for 'm' using some > calculation that might be doable with a single conditional move > instruction instead of a loop. Because I suspect that the inevitable > branch misprediction of a new loop is likely as expensive as a few > iterations through the core one. > > IOW, instead of > > m = 1UL << (BITS_PER_LONG - 2); > > perhaps something like > > m = 1UL << (BITS_PER_LONG/2- 2); > if (m < x) > m <<= BITS_PER_LONG/2; > > (assuming gcc can change that code into a "cmov") might cut down the > "lots of empty loops" case in half for small values of 'x'. > > There's probably some other better cheap initial guess value estimator. So I had a wee look at int_sqrt(), and yes Davidlohr's patch makes it worse for my machine (Xeon E3v5 aka. fancy Skylake). As in _MUCH_ worse.. ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 828,233,295 cycles:u ( +- 0.04% ) 2,480,095,771 instructions:u # 2.99 insn per cycle ( +- 0.00% ) 0.220202758 seconds time elapsed ( +- 0.18% ) ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 603,809,551 cycles:u ( +- 1.30% ) 1,677,726,591 instructions:u # 2.78 insn per cycle ( +- 0.00% ) 0.160801044 seconds time elapsed That is, we went from ~60 cycles to ~82 cycles per invocation. Now, the patches proposed in this thread do indeed improve matters but seem to not have merged up until this day: ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 598,827,196 cycles:u ( +- 0.18% ) 1,697,726,381 instructions:u # 2.84 insn per cycle ( +- 0.00% ) 0.159880738 seconds time elapsed ( +- 0.36% ) ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DLINUS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 601,463,184 cycles:u ( +- 0.07% ) 1,380,095,976 instructions:u # 2.29 insn per cycle ( +- 0.00% ) 0.160788095 seconds time elapsed ( +- 0.55% ) Which basically bring us back to the old performance. So I would strongly suggest we merge one. Now, the thing is, Intel CPUs are actually fairly good at fls(), so I gave Joe's suggestions a spin too.. ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 390,420,004 cycles:u ( +- 0.07% ) 1,141,802,493 instructions:u # 2.92 insn per cycle ( +- 0.00% ) 0.105480000 seconds time elapsed ( +- 0.64% ) ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 328,415,775 cycles:u ( +- 0.15% ) 1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% ) 0.088703205 seconds time elapsed Which gets us a real improvement... Now even with the software fls() fallback from include/asm-generic/bitops/fls.h there is a significant improvement: ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 -DSOFTFLS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 375,633,931 cycles:u ( +- 0.13% ) 1,289,700,013 instructions:u # 3.43 insn per cycle ( +- 0.00% ) 0.100596211 seconds time elapsed Also, I ran with -DVALIDATE=1 and while the comment says its a 'rough' estimate of the sqrt() I did not in fact find any value for which we deviate < INT_MAX. ----------------------[ sqrt.c ]--------------------------- #include <math.h> #include <stdio.h> #define BITS_PER_LONG (sizeof(long) * 8) #ifndef LOOPS #define LOOPS 1000000 #endif #ifdef SOFTFLS static __always_inline unsigned long fls(unsigned long x) { int r = 32; if (!x) return 0; if (!(x & 0xffff0000u)) { x <<= 16; r -= 16; } if (!(x & 0xff000000u)) { x <<= 8; r -= 8; } if (!(x & 0xf0000000u)) { x <<= 4; r -= 4; } if (!(x & 0xc0000000u)) { x <<= 2; r -= 2; } if (!(x & 0x80000000u)) { x <<= 1; r -= 1; } return r; } #else static __always_inline unsigned long fls(unsigned long word) { asm("rep; bsr %1,%0" : "=r" (word) : "rm" (word)); return BITS_PER_LONG - 1 - word; } #endif #ifdef NEW unsigned long __attribute__((noinline)) int_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; #ifdef LINUS m = 1UL << (BITS_PER_LONG/2 - 2); if (m < x) m <<= BITS_PER_LONG/2; #else #ifdef FLS m = 1UL << ((fls(x) + 1) & ~1UL); #else m = 1UL << (BITS_PER_LONG - 2); #endif #ifdef ANSHUL while (m > x) m >>= 2; #endif #endif while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } #else unsigned long __attribute__((noinline)) int_sqrt(unsigned long x) { unsigned long op, res, one; op = x; res = 0; one = 1UL << (BITS_PER_LONG - 2); while (one > op) one >>= 2; while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res /= 2; one /= 4; } return res; } #endif void main(void) { unsigned long i; for (i=0; i<LOOPS; i++) { unsigned long a = int_sqrt(i); asm volatile("" : : "r" (a)); #ifdef VALIDATE { unsigned long b = floor(sqrt(i)); if (a != b) printf("%ld %ld %ld\n", i, a, b); } #endif } } -------------- The only caveat seems to be that our fls() is only defined for 'int' not long :/ --- lib/int_sqrt.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc344977..87c3aa360441 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -7,12 +7,11 @@ #include <linux/kernel.h> #include <linux/export.h> +#include <linux/bitops.h> /** - * int_sqrt - rough approximation to sqrt + * int_sqrt - approximation to sqrt * @x: integer of which to calculate the sqrt - * - * A very rough approximation to the sqrt() function. */ unsigned long int_sqrt(unsigned long x) { @@ -21,7 +20,11 @@ unsigned long int_sqrt(unsigned long x) if (x <= 1) return x; - m = 1UL << (BITS_PER_LONG - 2); + m = 1UL << ((fls(x) + 1) & ~1UL); + + while (m > x) + m >>= 2; + while (m != 0) { b = y + m; y >>= 1; ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 11:24 ` Peter Zijlstra @ 2017-07-20 11:52 ` Joe Perches 2017-07-20 14:13 ` Peter Zijlstra 2017-07-20 15:27 ` Peter Zijlstra 2017-07-20 18:31 ` Linus Torvalds 2 siblings, 1 reply; 27+ messages in thread From: Joe Perches @ 2017-07-20 11:52 UTC (permalink / raw) To: Peter Zijlstra, Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote: [] > + m = 1UL << ((fls(x) + 1) & ~1UL); maybe #if BITS_PER_LONG == 64 m = 1UL << ((fls64(x) + 1) & ~1UL); #else m = 1UL << ((fls(x) + 1) & ~1UL); #endif ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 11:52 ` Joe Perches @ 2017-07-20 14:13 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2017-07-20 14:13 UTC (permalink / raw) To: Joe Perches Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner On Thu, Jul 20, 2017 at 04:52:49AM -0700, Joe Perches wrote: > On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote: > [] > > + m = 1UL << ((fls(x) + 1) & ~1UL); > > maybe > > #if BITS_PER_LONG == 64 > m = 1UL << ((fls64(x) + 1) & ~1UL); > #else > m = 1UL << ((fls(x) + 1) & ~1UL); > #endif We do seem to have __fls() which is defined on long. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 11:24 ` Peter Zijlstra 2017-07-20 11:52 ` Joe Perches @ 2017-07-20 15:27 ` Peter Zijlstra 2017-07-20 18:31 ` Linus Torvalds 2 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2017-07-20 15:27 UTC (permalink / raw) To: Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, joe On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote: > ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt > > Performance counter stats for './sqrt' (10 runs): > > 328,415,775 cycles:u ( +- 0.15% ) > 1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% ) > > 0.088703205 seconds time elapsed > static __always_inline unsigned long fls(unsigned long word) > { > asm("rep; bsr %1,%0" > : "=r" (word) > : "rm" (word)); > return BITS_PER_LONG - 1 - word; > } That is actually "lzcnt", if I used the regular fls implementation: static __always_inline unsigned long __fls(unsigned long word) { asm("bsr %1,%0" : "=r" (word) : "rm" (word)); return word; } It ends up slightly more expensive: ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt Performance counter stats for './sqrt' (10 runs): 384,842,215 cycles:u ( +- 0.08% ) 1,118,579,712 instructions:u # 2.91 insn per cycle ( +- 0.00% ) 0.103018001 seconds time elapsed Still loads cheaper than pretty much any other combination. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 11:24 ` Peter Zijlstra 2017-07-20 11:52 ` Joe Perches 2017-07-20 15:27 ` Peter Zijlstra @ 2017-07-20 18:31 ` Linus Torvalds 2017-07-20 22:34 ` Peter Zijlstra 2 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2017-07-20 18:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Joe Perches How did this two-year old thread get resurrected? Anyway, it got resurrected without even answering one core question: On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote: >>>> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > >> > (I'm also not entirely sure what uses int_sqrt() that ends up being so >> > performance-critical, so it would be good to document that too, since >> > that probably also matters for the "what's the normal argument range" >> > question..) This is still the case. Which of the (very few) users really _care_? And what are the normal values for that? For example, the 802.11 minstrel code does a "MINSTREL_TRUNC()" on a "unsigned int" value that is in millions. It's not even "unsigned long", so we know it's not many thousands of millions, and MINSTREL_TRUNC shifts it down by 12 bits. So we know we have at most a 20-bit argument. The one case that uses actual unsigned long seems to be "slow_is_prime_number()", and honestly, the sqrt() is the *least* of our problems there. There's a few drivers and filesystems that use it. I do not believe performance matters in those cases. Even if you do a "int_sqrt()" per nertwork packet on some high-performance network (and none of them look anything like that). And there's a couple of VM users. They don't look particularly critical either. So why do you care? Because honestly, calling int_sqrt() once in a blue moon with caches cold and no branch prediction information tends to have very different performance characteristics from calling it in a loop with very predictable input. So I think your "benchmark" is just garbage, in that it's testing something entirely different than the actual load. Also, since this is a generic library routine, no way can we depend on fls being fast. But we could certainly improve on the initial value a lot. It's just that we should probably strive to improve on it without adding extra branch misprediction or I$ misses - both things that your benchmark isn't actually testing at all, since it does the exact opposite of that by basically preloading both. And the *most* important question is that first one: "Why does this matter, and what is the range it matters for?" Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 18:31 ` Linus Torvalds @ 2017-07-20 22:34 ` Peter Zijlstra 2017-07-20 23:24 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2017-07-20 22:34 UTC (permalink / raw) To: Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Joe Perches On Thu, Jul 20, 2017 at 11:31:36AM -0700, Linus Torvalds wrote: > How did this two-year old thread get resurrected? I was looking for the original thread doing that 'optimization' Davidlohr did but found this first. > And the *most* important question is that first one: > > "Why does this matter, and what is the range it matters for?" I was looking to do some work on the idle estimator. Parts of that keep online avg and variance for normal distributions. I wanted to bias the avg downwards, the way to do that is to subtract a scaled stdev from it. Computing the stdev from a variance requires the sqrt. Thomas rightly asked how expensive our sqrt is, I found Davidlohr's commit message and got confused by the numbers, so I reran them and found the optimization did the reverse, it made things worse. By then I was prodding at it... 'fun' problem :-) In any case, I suppose the range of values would be in the order of TICK_NSEC, so the variance would be a number of orders below that. So we're looking at fairly small numbers <1e5. > Also, since this is a generic library routine, no way can we depend on > fls being fast. Which is why I also tested the software fls, but you're right in that the microbench primes the branch predictor. Still, the software fls is 6 branches, whereas the 'missing' loop: while (m > x) m >>= 2; would need up to 30 or so cycles worst case. So even in that respect it makes sense its a 'win', esp. so for small numbers. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 22:34 ` Peter Zijlstra @ 2017-07-20 23:24 ` Linus Torvalds 2017-07-21 11:40 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2017-07-20 23:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Joe Perches [-- Attachment #1: Type: text/plain, Size: 1777 bytes --] On Thu, Jul 20, 2017 at 3:34 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> "Why does this matter, and what is the range it matters for?" > > I was looking to do some work on the idle estimator. Parts of that keep > online avg and variance for normal distributions. I wanted to bias the > avg downwards, the way to do that is to subtract a scaled stdev from it. > Computing the stdev from a variance requires the sqrt. > > In any case, I suppose the range of values would be in the order of > TICK_NSEC, so the variance would be a number of orders below that. So > we're looking at fairly small numbers <1e5. Ok. So that already cuts down the problem space a lot. And yes, the current code is very suboptimal for exactly the small number case. It's also typiocally the case that right-shifting is more expensive than left-shifting, so the fact that we left-shift twice in that loop for all the big values is extra expensive. So I would actually suggest a different approach: - start with a small 'm' - and grow it up. The "small m" means that there is a small constant, which is good. And growing it up is a single trivial left shift by two, which can also be done just two adds or as a single lea, so it works fine even on architectures with slow shifters etc. So mayube something like the attached? NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code generation looks ok even if gcc is being way too clever and turns that first loop into a counted loop instead. Whatever, maybe it's the right thing to do. But note that I might have broken the sqrt for some case, so you need to double-check my logic there. The *intention* is that it's the exact same thing as it used to do, just initializing 'm' differently. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 745 bytes --] lib/int_sqrt.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc344977..da3b3dabad8e 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -16,14 +16,22 @@ */ unsigned long int_sqrt(unsigned long x) { - unsigned long b, m, y = 0; + unsigned long m, y; if (x <= 1) return x; - m = 1UL << (BITS_PER_LONG - 2); - while (m != 0) { - b = y + m; + m = 64; + do { + unsigned long new_m = m << 2; + if (!new_m) + break; + m = new_m; + } while (m < x); + + y = 0; + do { + unsigned long b = y + m; y >>= 1; if (x >= b) { @@ -31,7 +39,7 @@ unsigned long int_sqrt(unsigned long x) y += m; } m >>= 2; - } + } while (m); return y; } ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-20 23:24 ` Linus Torvalds @ 2017-07-21 11:40 ` Peter Zijlstra 2017-07-21 12:15 ` Joe Perches 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2017-07-21 11:40 UTC (permalink / raw) To: Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Joe Perches, Ingo Molnar, Will Deacon [-- Attachment #1: Type: text/plain, Size: 4744 bytes --] On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote: > So mayube something like the attached? > > NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code > generation looks ok even if gcc is being way too clever and turns that > first loop into a counted loop instead. Whatever, maybe it's the right > thing to do. It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work. I've made the test thing a bit more complicated in order to address your concerns for primed branch predictors (and once the branch predictors get wiped, the predictable input values don't matter anymore either I think, although I could easily LFSR generate them as well of course). Find it attached in a tarball. The results, from running ./test.sh (left SKL, right SNB): (values <100k) Cycles: EVENT=0 -DLINUS=1 event: 29.533080 +- 0.046261 36.800100 +- 0.046067 EVENT=0 -DLINUS=1 -DWIPE_BTB=1 event: 143.513440 +- 0.152651 153.054640 +- 0.099403 EVENT=0 -DNEW=1 -DANSHUL=1 event: 41.457300 +- 0.048409 55.046100 +- 0.044968 EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1 event: 128.351400 +- 0.366873 161.183690 +- 0.132301 EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 event: 21.433170 +- 0.043859 27.672700 +- 0.063982 EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1 event: 79.850210 +- 0.133646 112.422470 +- 0.101465 EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 event: 31.771680 +- 0.037655 37.515160 +- 0.080305 EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1 event: 129.673440 +- 0.430308 125.514390 +- 0.117590 Branches: EVENT=4 -DLINUS=1 event: 31.507740 +- 0.010470 31.507710 +- 0.010470 EVENT=4 -DLINUS=1 -DWIPE_BTB=1 event: 31.507720 +- 0.010470 31.507760 +- 0.010470 EVENT=4 -DNEW=1 -DANSHUL=1 event: 45.125510 +- 0.002696 45.125510 +- 0.002696 EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1 event: 45.125490 +- 0.002696 45.125540 +- 0.002697 EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 event: 21.252320 +- 0.005266 21.252330 +- 0.005266 EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1 event: 21.252320 +- 0.005266 21.252340 +- 0.005266 EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 event: 25.727940 +- 0.005975 25.727930 +- 0.005975 EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1 event: 25.727940 +- 0.005975 25.727930 +- 0.005975 Branch-Misses: EVENT=5 -DLINUS=1 event: 0.022670 +- 0.000789 0.020920 +- 0.000812 EVENT=5 -DLINUS=1 -DWIPE_BTB=1 event: 5.481750 +- 0.006930 5.955130 +- 0.005228 EVENT=5 -DNEW=1 -DANSHUL=1 event: 0.025990 +- 0.000798 0.017170 +- 0.000690 EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1 event: 3.343600 +- 0.004249 5.723570 +- 0.004876 EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 event: 0.019040 +- 0.000731 0.022570 +- 0.000829 EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1 event: 2.156350 +- 0.004643 4.297650 +- 0.004910 EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 event: 0.019800 +- 0.000747 0.016780 +- 0.000688 EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1 event: 4.481360 +- 0.005505 4.518300 +- 0.004877 Which still doesn't explain your hatred of fls(), as even with cold branches and a software fls, its faster than your latest proposal (as can be explained by the lower total number of branches for the software fls variant compared to your proposal). And when we start to look at larger values, the fls() one pulls even further ahead. So I'll stick with my proposal... I can write up a proper Changelog and maybe add my test to tools/testing/sqrt/ if you think its worth it. --- lib/int_sqrt.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc344977..611933760225 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -7,12 +7,13 @@ #include <linux/kernel.h> #include <linux/export.h> +#include <linux/bitops.h> /** - * int_sqrt - rough approximation to sqrt + * int_sqrt - Computes the integer square root * @x: integer of which to calculate the sqrt * - * A very rough approximation to the sqrt() function. + * Computes: floor(sqrt(@x)) */ unsigned long int_sqrt(unsigned long x) { @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x) if (x <= 1) return x; - m = 1UL << (BITS_PER_LONG - 2); + m = 1UL << (__fls(x) & ~1U); + + while (m > x) + m >>= 2; + while (m != 0) { b = y + m; y >>= 1; [-- Attachment #2: sqrt.tar --] [-- Type: application/x-tar, Size: 20480 bytes --] ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-21 11:40 ` Peter Zijlstra @ 2017-07-21 12:15 ` Joe Perches 2017-07-21 13:26 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Joe Perches @ 2017-07-21 12:15 UTC (permalink / raw) To: Peter Zijlstra, Linus Torvalds Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Ingo Molnar, Will Deacon On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote: > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x) > if (x <= 1) > return x; > > - m = 1UL << (BITS_PER_LONG - 2); > + m = 1UL << (__fls(x) & ~1U); > + > + while (m > x) > + m >>= 2; while (m > x) ? Belt and suspenders if __fls is broken? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-21 12:15 ` Joe Perches @ 2017-07-21 13:26 ` Peter Zijlstra 2017-07-21 13:33 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2017-07-21 13:26 UTC (permalink / raw) To: Joe Perches Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Ingo Molnar, Will Deacon On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote: > On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote: > > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x) > > if (x <= 1) > > return x; > > > > - m = 1UL << (BITS_PER_LONG - 2); > > + m = 1UL << (__fls(x) & ~1U); > > + > > + while (m > x) > > + m >>= 2; > > while (m > x) ? > > Belt and suspenders if __fls is broken? Hmm... you're right, that should not happen. It is a remnant from when I rounded up, like: m = 1UL << ((__fls(x) + 1) & ~1UL); Because I worried about the case where m == x, which is not included in the loop above (but works when you look at the actual computation loop and passes VALIDATE=1). But check this... I cannot explain :/ When I remove that loop, we, as fully expected, loose 1 branch, but the cycle count for the branch-cold case shoots up. Must be something GCC does. EVENT=0 -DNEW=1 -DFLS=1 event: 19.626050 +- 0.038995 EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1 event: 109.610670 +- 0.425667 EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 event: 21.445680 +- 0.043782 EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1 event: 83.590420 +- 0.142126 EVENT=4 -DNEW=1 -DFLS=1 event: 20.252330 +- 0.005265 EVENT=4 -DNEW=1 -DFLS=1 -DWIPE_BTB=1 event: 20.252340 +- 0.005265 EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 event: 21.252300 +- 0.005266 EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1 event: 21.252300 +- 0.005266 EVENT=5 -DNEW=1 -DFLS=1 event: 0.019370 +- 0.000732 EVENT=5 -DNEW=1 -DFLS=1 -DWIPE_BTB=1 event: 3.665240 +- 0.005309 EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 event: 0.020150 +- 0.000755 EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1 event: 2.225330 +- 0.004875 Let me dig out another GCC version current: gcc (Debian 6.3.0-18) 6.3.0 20170516 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2017-07-21 13:26 ` Peter Zijlstra @ 2017-07-21 13:33 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2017-07-21 13:33 UTC (permalink / raw) To: Joe Perches Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List, anshul.g, Thomas Gleixner, Ingo Molnar, Will Deacon On Fri, Jul 21, 2017 at 03:26:21PM +0200, Peter Zijlstra wrote: > > EVENT=0 -DNEW=1 -DFLS=1 > event: 19.626050 +- 0.038995 > EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1 > event: 109.610670 +- 0.425667 > > EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 > event: 21.445680 +- 0.043782 > EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1 > event: 83.590420 +- 0.142126 > > Let me dig out another GCC version current: > > gcc (Debian 6.3.0-18) 6.3.0 20170516 gcc-7 (Debian 7.1.0-9) 7.1.0 EVENT=0 -DNEW=1 -DFLS=1 event: 24.179400 +- 0.031344 EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1 event: 137.892390 +- 0.307314 EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 event: 22.740300 +- 0.051317 EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1 event: 136.980640 +- 0.223410 GCC regressed it seems... *sigh* ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 19:00 ` Linus Torvalds 2015-02-02 19:13 ` Linus Torvalds @ 2015-02-03 4:47 ` Davidlohr Bueso 2015-02-03 15:42 ` Anshul Garg 1 sibling, 1 reply; 27+ messages in thread From: Davidlohr Bueso @ 2015-02-03 4:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote: > Hmm. I don't disagree, but would like some more feedback. > > Davidlohr - you were the person to touch this function last (commit > 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and > you did so for performance reasons. And in fact, when you did that, > you removed that initial loop: > > - one = 1UL << (BITS_PER_LONG - 2); > - while (one > op) > - one >>= 2; > > but I'm not sure that was actually all that conscious, I think the > real optimization was the changes inside the loop to make the final > real loop faster and simpler. I missed that. And yes, the real optimization should be in the loop. > > Also, you had performance numbers, so presumably a test harness for it > all. It probably depends a lot on the actual distribution of argument > values, of course, but it would be good to accompany the patch with > actual real numbers like lasty time. Aha. In my case I recall I ran a usersapce program using each function from 1 to a million, and throwing perf at it for 10 times. > (I'm also not entirely sure what uses int_sqrt() that ends up being so > performance-critical, so it would be good to document that too, since > that probably also matters for the "what's the normal argument range" > question..) It's not a big deal afaik. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-03 4:47 ` Davidlohr Bueso @ 2015-02-03 15:42 ` Anshul Garg 2015-02-05 18:20 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Anshul Garg @ 2015-02-03 15:42 UTC (permalink / raw) To: Davidlohr Bueso; +Cc: Linus Torvalds, Linux Kernel Mailing List, anshul.g On Tue, Feb 3, 2015 at 10:17 AM, Davidlohr Bueso <dave@stgolabs.net> wrote: > On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote: >> Hmm. I don't disagree, but would like some more feedback. >> >> Davidlohr - you were the person to touch this function last (commit >> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and >> you did so for performance reasons. And in fact, when you did that, >> you removed that initial loop: >> >> - one = 1UL << (BITS_PER_LONG - 2); >> - while (one > op) >> - one >>= 2; >> >> but I'm not sure that was actually all that conscious, I think the >> real optimization was the changes inside the loop to make the final >> real loop faster and simpler. > > I missed that. And yes, the real optimization should be in the loop. > >> >> Also, you had performance numbers, so presumably a test harness for it >> all. It probably depends a lot on the actual distribution of argument >> values, of course, but it would be good to accompany the patch with >> actual real numbers like lasty time. > > Aha. In my case I recall I ran a usersapce program using each function > from 1 to a million, and throwing perf at it for 10 times. > I have done profiling of int_sqrt function using perf tool for 10 times. For this purpose i have created a userspace program which uses sqrt function from 1 to a million. int_sqrt_old -> current algorithm version int_sqrt_new -> with proposed change these results are for BITS_PER_LONG=64. Performance counter stats for './int_sqrt_old' (10 runs): 460.944061 task-clock (msec) # 0.969 CPUs utilized ( +- 1.72% ) 64 context-switches # 0.139 K/sec ( +- 2.27% ) 0 cpu-migrations # 0.000 K/sec 132 page-faults # 0.286 K/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 0.475795341 seconds time elapsed( +- 3.20% ) Performance counter stats for './int_sqrt_new' (10 runs): 401.782119 task-clock (msec) # 0.974 CPUs utilized( +- 1.55% ) 57 context-switches # 0.141 K/sec( +- 1.92% ) 0 cpu-migrations # 0.000 K/sec 132 page-faults # 0.329 K/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 0.412593296 seconds time elapsed( +- 2.03% ) As per profiling definitely there is improvement in algorithm timing. >> (I'm also not entirely sure what uses int_sqrt() that ends up being so >> performance-critical, so it would be good to document that too, since >> that probably also matters for the "what's the normal argument range" >> question..) > > It's not a big deal afaik. > > Thanks, > Davidlohr > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-03 15:42 ` Anshul Garg @ 2015-02-05 18:20 ` Linus Torvalds 2015-02-05 18:33 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2015-02-05 18:20 UTC (permalink / raw) To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g [-- Attachment #1: Type: text/plain, Size: 1328 bytes --] On Tue, Feb 3, 2015 at 7:42 AM, Anshul Garg <aksgarg1989@gmail.com> wrote: > > I have done profiling of int_sqrt function using perf tool for 10 times. > For this purpose i have created a userspace program which uses sqrt function > from 1 to a million. Hmm. I did that too, and it doesn't improve things for me. In fact, it makes it slower. [torvalds@i7 ~]$ gcc -Wall -O2 -DREDUCE int_sqrt.c ; time ./a.out real 0m2.098s user 0m2.095s sys 0m0.000s [torvalds@i7 ~]$ gcc -Wall -O2 int_sqrt.c ; time ./a.out real 0m1.886s user 0m1.883s sys 0m0.000s and the profile shows that 35% of the time is spent on that branch back of the initial reduction loop. In contrast, my suggested "reduce just once" does seem to improve things: [torvalds@i7 ~]$ gcc -Wall -O2 -DONCE int_sqrt.c ; time ./a.out real 0m1.436s user 0m1.434s sys 0m0.000s but it's kind of hacky. NOTE! This probably depends a lot on microarchitecture details, including very much branch predictor etc. And I didn't actually check that it gives the right result, but I do think that this optimization needs to be looked at more if we want to do it. I was running this on an i7-4770S, fwiw. Attached is the stupid test-program I used to do the above. Maybe I did something wrong. Linus [-- Attachment #2: int_sqrt.c --] [-- Type: text/x-csrc, Size: 951 bytes --] /* * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com> * * Based on the shift-and-subtract algorithm for computing integer * square root from Guy L. Steele. */ #define BITS_PER_LONG (8*sizeof(long)) /** * int_sqrt - rough approximation to sqrt * @x: integer of which to calculate the sqrt * * A very rough approximation to the sqrt() function. */ unsigned long __attribute__((noinline)) int_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (BITS_PER_LONG - 2); #ifdef REDUCE while (m > x) m >>= 2; #elif defined(ONCE) { unsigned long n = m >> (BITS_PER_LONG/2); m = (n >= x) ? n : m; } #endif while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } int main(int argc, char **argv) { unsigned long i; for (i = 0; i < 100000000; i++) { unsigned long a = int_sqrt(i); asm volatile("": : "r" (a)); } return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-05 18:20 ` Linus Torvalds @ 2015-02-05 18:33 ` Linus Torvalds 2015-02-05 18:43 ` Anshul Garg 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2015-02-05 18:33 UTC (permalink / raw) To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Hmm. I did that too [..] Side note: one difference in our results (apart from possibly just CPU uarch details) is that my loop goes to 100M to make it easier to just time it. Which means that my load essentially had about three more iterations over nonzero data. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-05 18:33 ` Linus Torvalds @ 2015-02-05 18:43 ` Anshul Garg 2015-02-05 19:37 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Anshul Garg @ 2015-02-05 18:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g [-- Attachment #1: Type: text/plain, Size: 1134 bytes --] On Thu, Feb 5, 2015 at 10:33 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Hmm. I did that too [..] > > Side note: one difference in our results (apart from possibly just CPU > uarch details) is that my loop goes to 100M to make it easier to just > time it. Which means that my load essentially had about three more > iterations over nonzero data. > > Linus I have also done the same testing on 100 million numbers. Attaching source codes. Below is the result :: int_sqrt_old - current int_sqrt_new - With proposed change anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_old real 0m41.895s user 0m36.490s sys 0m0.365s anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_new real 0m39.491s user 0m36.495s sys 0m0.338s I have run this test on Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz VMWare Machine. Please check if i am doing anything wrong. NOTE :: I have not used gcc optimizations while compilation. With O2 level optimization proposed solution is taking more time. [-- Attachment #2: int_sqrt_old.c --] [-- Type: text/x-csrc, Size: 721 bytes --] /* * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com> * * Based on the shift-and-subtract algorithm for computing integer * square root from Guy L. Steele. */ #include <stdio.h> /** * int_sqrt - rough approximation to sqrt * @x: integer of which to calculate the sqrt * * A very rough approximation to the sqrt() function. */ #define BITS_PER_LONG (8*sizeof(long)) unsigned long int_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (BITS_PER_LONG - 2); while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } int main() { unsigned long n = 1; for(;n<=100000000;++n) int_sqrt(n); return 0; } [-- Attachment #3: int_sqrt_new.c --] [-- Type: text/x-csrc, Size: 750 bytes --] /* * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com> * * Based on the shift-and-subtract algorithm for computing integer * square root from Guy L. Steele. */ #include <stdio.h> #define BITS_PER_LONG (8*sizeof(long)) /** * int_sqrt - rough approximation to sqrt * @x: integer of which to calculate the sqrt * * A very rough approximation to the sqrt() function. */ unsigned long int_sqrt(unsigned long x) { unsigned long b, m, y = 0; if (x <= 1) return x; m = 1UL << (BITS_PER_LONG - 2); while(m > x ) m >>= 2; while (m != 0) { b = y + m; y >>= 1; if (x >= b) { x -= b; y += m; } m >>= 2; } return y; } int main() { unsigned long n = 1; for(;n<=100000000;++n) int_sqrt(n); return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-05 18:43 ` Anshul Garg @ 2015-02-05 19:37 ` Linus Torvalds 2015-02-08 15:39 ` Anshul Garg 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2015-02-05 19:37 UTC (permalink / raw) To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <aksgarg1989@gmail.com> wrote: > > NOTE :: > I have not used gcc optimizations while compilation. > With O2 level optimization proposed solution is taking more time. The thing is, the kernel is compiled with -O2, so that's what matters. Also, for very tight loops like this, the major costs tend to be very subtle microarchitectural details, particularly branch prediction. Which in turn end up sometimes depending on just exactly where the branches were placed, and even whether two conditional branches were in the same 8-byte aligned region etc things (because the branch prediction might be done ignoring the low bits of the EIP etc). So not only does the exact microarchitecture matter, things that don't *seem* like they should matter can change behavior a lot. My point is really that the performance numbers are very ambiguous. The patch may well help in some situations, but hurt in others. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-05 19:37 ` Linus Torvalds @ 2015-02-08 15:39 ` Anshul Garg 0 siblings, 0 replies; 27+ messages in thread From: Anshul Garg @ 2015-02-08 15:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g Dear Mr. linus, Thanks for quick replies. Yes performance numbers are not conclusive enough. So its better to discard this patch as of now. I will try to explore more in this area. Thanks & regards Anshul Garg On Fri, Feb 6, 2015 at 1:07 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <aksgarg1989@gmail.com> wrote: >> >> NOTE :: >> I have not used gcc optimizations while compilation. >> With O2 level optimization proposed solution is taking more time. > > The thing is, the kernel is compiled with -O2, so that's what matters. > > Also, for very tight loops like this, the major costs tend to be very > subtle microarchitectural details, particularly branch prediction. > Which in turn end up sometimes depending on just exactly where the > branches were placed, and even whether two conditional branches were > in the same 8-byte aligned region etc things (because the branch > prediction might be done ignoring the low bits of the EIP etc). So not > only does the exact microarchitecture matter, things that don't *seem* > like they should matter can change behavior a lot. > > My point is really that the performance numbers are very ambiguous. > The patch may well help in some situations, but hurt in others. > > Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 17:12 [PATCH] lib/int_sqrt.c: Optimize square root function Anshul Garg 2015-02-02 19:00 ` Linus Torvalds @ 2015-02-02 19:10 ` Joe Perches 2015-02-02 19:39 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Joe Perches @ 2015-02-02 19:10 UTC (permalink / raw) To: Anshul Garg; +Cc: linux-kernel, anshul.g, torvalds On Mon, 2015-02-02 at 09:12 -0800, Anshul Garg wrote: > From: Anshul Garg <aksgarg1989@gmail.com> > > Unnecessary instructions are executing even though m is > greater than x so added logic to make m less than equal to > x before performing these operations. > > Signed-off-by: Anshul Garg <aksgarg1989@gmail.com> > --- > lib/int_sqrt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c [] > @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x) > return x; > > m = 1UL << (BITS_PER_LONG - 2); > + > + while (m > x) > + m >>= 2; Perhaps removing the while and using fls(x) would be better. Perhaps something like: m = 1UL << min(sizeof(unsigned long) == sizeof(u64) ? fls64(x) : fls(x), BITS_PER_LONG - 2); > while (m != 0) { > b = y + m; > y >>= 1; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] lib/int_sqrt.c: Optimize square root function 2015-02-02 19:10 ` Joe Perches @ 2015-02-02 19:39 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2015-02-02 19:39 UTC (permalink / raw) To: Joe Perches; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g On Mon, Feb 2, 2015 at 11:10 AM, Joe Perches <joe@perches.com> wrote: > > Perhaps removing the while and using fls(x) would be better. No. fls is often slow, and you then do have to also round it down to an even bit number etc, so it gets somewhat complex too. We *probably* have some argument range that we care more about, which is why I'd like to know what the profile is that triggered this optimization, and what the common argument range is. For example, one user is the cpu frequency governor, which seems to try to predict the length of the next sleep. The values can probably be anything (it has protections for overflowing into "long long"), but presumably the onyl time we care about performance is when it's called very very often, in which case I'm going out on a limb and saying that the intervals will be short, and the standard deviation of those intervals will also be small. So *maybe* that function really only cares about performance when we have small arguments. I dunno. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] lib/int_sqrt.c: Optimize square root function @ 2015-01-29 13:35 Anshul Garg 0 siblings, 0 replies; 27+ messages in thread From: Anshul Garg @ 2015-01-29 13:35 UTC (permalink / raw) To: linux-kernel; +Cc: aksgarg1989, anshul.g From: Anshul Garg <aksgarg1989@gmail.com> Unnecessary instructions are executing even though m is greater than x so added logic to make m less than equal to x before performing these operations. Signed-off-by: Anshul Garg <aksgarg1989@gmail.com> --- lib/int_sqrt.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c index 1ef4cc3..64ae722 100644 --- a/lib/int_sqrt.c +++ b/lib/int_sqrt.c @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x) return x; m = 1UL << (BITS_PER_LONG - 2); + + while (m > x) + m >>= 2; while (m != 0) { b = y + m; y >>= 1; -- 1.7.9.5 --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com ^ permalink raw reply related [flat|nested] 27+ messages in thread
end of thread, other threads:[~2017-07-21 13:33 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-02 17:12 [PATCH] lib/int_sqrt.c: Optimize square root function Anshul Garg 2015-02-02 19:00 ` Linus Torvalds 2015-02-02 19:13 ` Linus Torvalds 2015-02-03 4:41 ` Davidlohr Bueso 2015-02-03 4:57 ` Davidlohr Bueso 2015-02-03 15:54 ` Anshul Garg 2017-07-20 11:24 ` Peter Zijlstra 2017-07-20 11:52 ` Joe Perches 2017-07-20 14:13 ` Peter Zijlstra 2017-07-20 15:27 ` Peter Zijlstra 2017-07-20 18:31 ` Linus Torvalds 2017-07-20 22:34 ` Peter Zijlstra 2017-07-20 23:24 ` Linus Torvalds 2017-07-21 11:40 ` Peter Zijlstra 2017-07-21 12:15 ` Joe Perches 2017-07-21 13:26 ` Peter Zijlstra 2017-07-21 13:33 ` Peter Zijlstra 2015-02-03 4:47 ` Davidlohr Bueso 2015-02-03 15:42 ` Anshul Garg 2015-02-05 18:20 ` Linus Torvalds 2015-02-05 18:33 ` Linus Torvalds 2015-02-05 18:43 ` Anshul Garg 2015-02-05 19:37 ` Linus Torvalds 2015-02-08 15:39 ` Anshul Garg 2015-02-02 19:10 ` Joe Perches 2015-02-02 19:39 ` Linus Torvalds -- strict thread matches above, loose matches on Subject: below -- 2015-01-29 13:35 Anshul Garg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.