All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] lib/int_sqrt.c: Optimize square root function
@ 2015-01-29 13:35 Anshul Garg
  0 siblings, 0 replies; 27+ messages in thread
From: Anshul Garg @ 2015-01-29 13:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: aksgarg1989, anshul.g

From: Anshul Garg <aksgarg1989@gmail.com>

Unnecessary instructions are executing even though m is
greater than x so added logic to make m less than equal to
x before performing these operations.

Signed-off-by: Anshul Garg <aksgarg1989@gmail.com>
---
 lib/int_sqrt.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc3..64ae722 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
 		return x;
 
 	m = 1UL << (BITS_PER_LONG - 2);
+
+	while (m > x)
+		m >>= 2;
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;
-- 
1.7.9.5


---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-21 13:26                 ` Peter Zijlstra
@ 2017-07-21 13:33                   ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-21 13:33 UTC (permalink / raw)
  To: Joe Perches
  Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso,
	Linux Kernel Mailing List, anshul.g, Thomas Gleixner,
	Ingo Molnar, Will Deacon

On Fri, Jul 21, 2017 at 03:26:21PM +0200, Peter Zijlstra wrote:
> 
> EVENT=0 -DNEW=1 -DFLS=1
> event: 19.626050 +- 0.038995
> EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
> event: 109.610670 +- 0.425667
> 
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
> event: 21.445680 +- 0.043782
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
> event: 83.590420 +- 0.142126
> 

> Let me dig out another GCC version current:
> 
>   gcc (Debian 6.3.0-18) 6.3.0 20170516

gcc-7 (Debian 7.1.0-9) 7.1.0

EVENT=0 -DNEW=1 -DFLS=1
event: 24.179400 +- 0.031344
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 137.892390 +- 0.307314

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 22.740300 +- 0.051317
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 136.980640 +- 0.223410


GCC regressed it seems... *sigh*

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-21 12:15               ` Joe Perches
@ 2017-07-21 13:26                 ` Peter Zijlstra
  2017-07-21 13:33                   ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-21 13:26 UTC (permalink / raw)
  To: Joe Perches
  Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso,
	Linux Kernel Mailing List, anshul.g, Thomas Gleixner,
	Ingo Molnar, Will Deacon

On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote:
> On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> >  	if (x <= 1)
> >  		return x;
> >  
> > -	m = 1UL << (BITS_PER_LONG - 2);
> > +	m = 1UL << (__fls(x) & ~1U);
> > +
> > +	while (m > x)
> > +		m >>= 2;
> 
> while (m > x) ?
> 
> Belt and suspenders if __fls is broken?

Hmm... you're right, that should not happen. It is a remnant from when I
rounded up, like:

	m = 1UL << ((__fls(x) + 1) & ~1UL);

Because I worried about the case where m == x, which is not included in
the loop above (but works when you look at the actual computation loop
and passes VALIDATE=1).


But check this... I cannot explain :/

When I remove that loop, we, as fully expected, loose 1 branch, but the
cycle count for the branch-cold case shoots up. Must be something GCC
does.


EVENT=0 -DNEW=1 -DFLS=1
event: 19.626050 +- 0.038995
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 109.610670 +- 0.425667

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.445680 +- 0.043782
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 83.590420 +- 0.142126


EVENT=4 -DNEW=1 -DFLS=1
event: 20.252330 +- 0.005265
EVENT=4 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 20.252340 +- 0.005265

EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.252300 +- 0.005266
EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 21.252300 +- 0.005266


EVENT=5 -DNEW=1 -DFLS=1
event: 0.019370 +- 0.000732
EVENT=5 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 3.665240 +- 0.005309

EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 0.020150 +- 0.000755
EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 2.225330 +- 0.004875


Let me dig out another GCC version current:

  gcc (Debian 6.3.0-18) 6.3.0 20170516

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-21 11:40             ` Peter Zijlstra
@ 2017-07-21 12:15               ` Joe Perches
  2017-07-21 13:26                 ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Joe Perches @ 2017-07-21 12:15 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, Ingo Molnar, Will Deacon

On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
>  	if (x <= 1)
>  		return x;
>  
> -	m = 1UL << (BITS_PER_LONG - 2);
> +	m = 1UL << (__fls(x) & ~1U);
> +
> +	while (m > x)
> +		m >>= 2;

while (m > x) ?

Belt and suspenders if __fls is broken?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 23:24           ` Linus Torvalds
@ 2017-07-21 11:40             ` Peter Zijlstra
  2017-07-21 12:15               ` Joe Perches
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-21 11:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, Joe Perches, Ingo Molnar, Will Deacon

[-- Attachment #1: Type: text/plain, Size: 4744 bytes --]

On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:

> So mayube something like the attached?
> 
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Whatever, maybe it's the right
> thing to do.

It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work.


I've made the test thing a bit more complicated in order to address your
concerns for primed branch predictors (and once the branch predictors
get wiped, the predictable input values don't matter anymore either I
think, although I could easily LFSR generate them as well of course).

Find it attached in a tarball.


The results, from running ./test.sh (left SKL, right SNB):

(values <100k)

Cycles:


EVENT=0 -DLINUS=1
event: 29.533080 +- 0.046261			36.800100 +- 0.046067
EVENT=0 -DLINUS=1 -DWIPE_BTB=1
event: 143.513440 +- 0.152651                   153.054640 +- 0.099403
EVENT=0 -DNEW=1 -DANSHUL=1
event: 41.457300 +- 0.048409                    55.046100 +- 0.044968
EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 128.351400 +- 0.366873                   161.183690 +- 0.132301
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.433170 +- 0.043859                    27.672700 +- 0.063982
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 79.850210 +- 0.133646                    112.422470 +- 0.101465
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 31.771680 +- 0.037655                    37.515160 +- 0.080305
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 129.673440 +- 0.430308                   125.514390 +- 0.117590


Branches:


EVENT=4 -DLINUS=1
event: 31.507740 +- 0.010470                    31.507710 +- 0.010470
EVENT=4 -DLINUS=1 -DWIPE_BTB=1
event: 31.507720 +- 0.010470                    31.507760 +- 0.010470
EVENT=4 -DNEW=1 -DANSHUL=1
event: 45.125510 +- 0.002696                    45.125510 +- 0.002696
EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 45.125490 +- 0.002696                    45.125540 +- 0.002697
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.252320 +- 0.005266                    21.252330 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 21.252320 +- 0.005266                    21.252340 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975


Branch-Misses:


EVENT=5 -DLINUS=1
event: 0.022670 +- 0.000789                     0.020920 +- 0.000812
EVENT=5 -DLINUS=1 -DWIPE_BTB=1
event: 5.481750 +- 0.006930                     5.955130 +- 0.005228
EVENT=5 -DNEW=1 -DANSHUL=1
event: 0.025990 +- 0.000798                     0.017170 +- 0.000690
EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 3.343600 +- 0.004249                     5.723570 +- 0.004876
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 0.019040 +- 0.000731                     0.022570 +- 0.000829
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 2.156350 +- 0.004643                     4.297650 +- 0.004910
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 0.019800 +- 0.000747                     0.016780 +- 0.000688
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 4.481360 +- 0.005505                     4.518300 +- 0.004877




Which still doesn't explain your hatred of fls(), as even with cold
branches and a software fls, its faster than your latest proposal (as
can be explained by the lower total number of branches for the software
fls variant compared to your proposal).


And when we start to look at larger values, the fls() one pulls even
further ahead.

So I'll stick with my proposal... I can write up a proper Changelog and
maybe add my test to tools/testing/sqrt/ if you think its worth it.


---
 lib/int_sqrt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..611933760225 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,13 @@
 
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/bitops.h>
 
 /**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - Computes the integer square root
  * @x: integer of which to calculate the sqrt
  *
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(@x))
  */
 unsigned long int_sqrt(unsigned long x)
 {
@@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
+	m = 1UL << (__fls(x) & ~1U);
+
+	while (m > x)
+		m >>= 2;
+
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;

[-- Attachment #2: sqrt.tar --]
[-- Type: application/x-tar, Size: 20480 bytes --]

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 22:34         ` Peter Zijlstra
@ 2017-07-20 23:24           ` Linus Torvalds
  2017-07-21 11:40             ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2017-07-20 23:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, Joe Perches

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Thu, Jul 20, 2017 at 3:34 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>  "Why does this matter, and what is the range it matters for?"
>
> I was looking to do some work on the idle estimator. Parts of that keep
> online avg and variance for normal distributions. I wanted to bias the
> avg downwards, the way to do that is to subtract a scaled stdev from it.
> Computing the stdev from a variance requires the sqrt.
>
> In any case, I suppose the range of values would be in the order of
> TICK_NSEC, so the variance would be a number of orders below that. So
> we're looking at fairly small numbers <1e5.

Ok. So that already cuts down the problem space a lot.

And yes, the current code is very suboptimal for exactly the small number case.

It's also typiocally the case that right-shifting is more expensive
than left-shifting, so the fact that we left-shift twice in that loop
for all the big values is extra expensive.

So I would actually suggest a different approach:

 - start with a small 'm'

 - and grow it up.

The "small m" means that there is a small constant, which is good.

And growing it up is a single trivial left shift by two, which can
also be done just two adds or as a single lea, so it works fine even
on architectures with slow shifters etc.

So mayube something like the attached?

NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
generation looks ok even if gcc is being way too clever and turns that
first loop into a counted loop instead. Whatever, maybe it's the right
thing to do.

But note that I might have broken the sqrt for some case, so you need
to double-check my logic there. The *intention* is that it's the exact
same thing as it used to do, just initializing 'm' differently.

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 745 bytes --]

 lib/int_sqrt.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..da3b3dabad8e 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -16,14 +16,22 @@
  */
 unsigned long int_sqrt(unsigned long x)
 {
-	unsigned long b, m, y = 0;
+	unsigned long m, y;
 
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
-	while (m != 0) {
-		b = y + m;
+	m = 64;
+	do {
+		unsigned long new_m = m << 2;
+		if (!new_m)
+			break;
+		m = new_m;
+	} while (m < x);
+
+	y = 0;
+	do {
+		unsigned long b = y + m;
 		y >>= 1;
 
 		if (x >= b) {
@@ -31,7 +39,7 @@ unsigned long int_sqrt(unsigned long x)
 			y += m;
 		}
 		m >>= 2;
-	}
+	} while (m);
 
 	return y;
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 18:31       ` Linus Torvalds
@ 2017-07-20 22:34         ` Peter Zijlstra
  2017-07-20 23:24           ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-20 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, Joe Perches

On Thu, Jul 20, 2017 at 11:31:36AM -0700, Linus Torvalds wrote:
> How did this two-year old thread get resurrected?

I was looking for the original thread doing that 'optimization'
Davidlohr did but found this first.

> And the *most* important question is that first one:
> 
>  "Why does this matter, and what is the range it matters for?"

I was looking to do some work on the idle estimator. Parts of that keep
online avg and variance for normal distributions. I wanted to bias the
avg downwards, the way to do that is to subtract a scaled stdev from it.
Computing the stdev from a variance requires the sqrt.

Thomas rightly asked how expensive our sqrt is, I found Davidlohr's
commit message and got confused by the numbers, so I reran them and
found the optimization did the reverse, it made things worse.

By then I was prodding at it... 'fun' problem :-)


In any case, I suppose the range of values would be in the order of
TICK_NSEC, so the variance would be a number of orders below that. So
we're looking at fairly small numbers <1e5.

> Also, since this is a generic library routine, no way can we depend on
> fls being fast.

Which is why I also tested the software fls, but you're right in that
the microbench primes the branch predictor. Still, the software fls is 6
branches, whereas the 'missing' loop:

  while (m > x)
    m >>= 2;

would need up to 30 or so cycles worst case. So even in that respect it
makes sense its a 'win', esp. so for small numbers.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 11:24     ` Peter Zijlstra
  2017-07-20 11:52       ` Joe Perches
  2017-07-20 15:27       ` Peter Zijlstra
@ 2017-07-20 18:31       ` Linus Torvalds
  2017-07-20 22:34         ` Peter Zijlstra
  2 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2017-07-20 18:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, Joe Perches

How did this two-year old thread get resurrected?

Anyway, it got resurrected without even answering one core question:

On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
>>>> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> >
>> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
>> > performance-critical, so it would be good to document that too, since
>> > that probably also matters for the "what's the normal argument range"
>> > question..)

This is still the case. Which of the (very few) users really _care_?
And what are the normal values for that?

For example, the 802.11 minstrel code does a "MINSTREL_TRUNC()" on a
"unsigned int" value that is in millions. It's not even "unsigned
long", so we know it's not many thousands of millions, and
MINSTREL_TRUNC shifts it down by 12 bits.

So we know we have at most a 20-bit argument.

The one case that uses actual unsigned long seems to be
"slow_is_prime_number()", and honestly, the sqrt() is the *least* of
our problems there.

There's a few drivers and filesystems that use it. I do not believe
performance matters in those cases. Even if you do a "int_sqrt()" per
nertwork packet on some high-performance network (and none of them
look anything like that).

And there's a couple of VM users. They don't look particularly critical either.

So why do you care? Because honestly, calling int_sqrt() once in a
blue moon with caches cold and no branch prediction information tends
to have very different performance characteristics from calling it in
a loop with very predictable input.

So I think your "benchmark" is just garbage, in that it's testing
something entirely different than the actual load.

Also, since this is a generic library routine, no way can we depend on
fls being fast.

But we could certainly improve on the initial value a lot. It's just
that we should probably strive to improve on it without adding extra
branch misprediction or I$ misses - both things that your benchmark
isn't actually testing at all, since it does the exact opposite of
that by basically preloading both.

And the *most* important question is that first one:

 "Why does this matter, and what is the range it matters for?"

                  Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 11:24     ` Peter Zijlstra
  2017-07-20 11:52       ` Joe Perches
@ 2017-07-20 15:27       ` Peter Zijlstra
  2017-07-20 18:31       ` Linus Torvalds
  2 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-20 15:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, joe

On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
> 
>  Performance counter stats for './sqrt' (10 runs):
> 
>        328,415,775      cycles:u                                                      ( +-  0.15% )
>      1,138,579,704      instructions:u            #    3.47  insn per cycle           ( +-  0.00% )
> 
>        0.088703205 seconds time elapsed  

> static __always_inline unsigned long fls(unsigned long word)
> {
> 	asm("rep; bsr %1,%0"
> 		: "=r" (word)
> 		: "rm" (word));
> 	return BITS_PER_LONG - 1 - word;
> }

That is actually "lzcnt", if I used the regular fls implementation:


static __always_inline unsigned long __fls(unsigned long word)
{
	asm("bsr %1,%0"
		: "=r" (word)
		: "rm" (word));
	return word;
}

It ends up slightly more expensive:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       384,842,215      cycles:u                                                      ( +-  0.08% )
     1,118,579,712      instructions:u            #    2.91  insn per cycle           ( +-  0.00% )

       0.103018001 seconds time elapsed  


Still loads cheaper than pretty much any other combination.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 11:52       ` Joe Perches
@ 2017-07-20 14:13         ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-20 14:13 UTC (permalink / raw)
  To: Joe Perches
  Cc: Linus Torvalds, Anshul Garg, Davidlohr Bueso,
	Linux Kernel Mailing List, anshul.g, Thomas Gleixner

On Thu, Jul 20, 2017 at 04:52:49AM -0700, Joe Perches wrote:
> On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
> []
> > +	m = 1UL << ((fls(x) + 1) & ~1UL);
> 
> maybe 
> 
> #if BITS_PER_LONG == 64
> 	m = 1UL << ((fls64(x) + 1) & ~1UL);
> #else
> 	m = 1UL << ((fls(x) + 1) & ~1UL);
> #endif

We do seem to have __fls() which is defined on long.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2017-07-20 11:24     ` Peter Zijlstra
@ 2017-07-20 11:52       ` Joe Perches
  2017-07-20 14:13         ` Peter Zijlstra
  2017-07-20 15:27       ` Peter Zijlstra
  2017-07-20 18:31       ` Linus Torvalds
  2 siblings, 1 reply; 27+ messages in thread
From: Joe Perches @ 2017-07-20 11:52 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner

On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
[]
> +	m = 1UL << ((fls(x) + 1) & ~1UL);

maybe 

#if BITS_PER_LONG == 64
	m = 1UL << ((fls64(x) + 1) & ~1UL);
#else
	m = 1UL << ((fls(x) + 1) & ~1UL);
#endif

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 19:13   ` Linus Torvalds
  2015-02-03  4:41     ` Davidlohr Bueso
@ 2017-07-20 11:24     ` Peter Zijlstra
  2017-07-20 11:52       ` Joe Perches
                         ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Peter Zijlstra @ 2017-07-20 11:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Anshul Garg, Davidlohr Bueso, Linux Kernel Mailing List,
	anshul.g, Thomas Gleixner, joe

On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
> > performance-critical, so it would be good to document that too, since
> > that probably also matters for the "what's the normal argument range"
> > question..)
> 
> ... it's also not entirely clear that we need a whole new loop. We
> might just instead start off with a better guess for 'm' using some
> calculation that might be doable with a single conditional move
> instruction instead of a loop. Because I suspect that the inevitable
> branch misprediction of a new loop is likely as expensive as a few
> iterations through the core one.
> 
> IOW, instead of
> 
>   m = 1UL << (BITS_PER_LONG - 2);
> 
> perhaps something like
> 
>    m = 1UL << (BITS_PER_LONG/2- 2);
>    if (m < x)
>        m <<= BITS_PER_LONG/2;
> 
> (assuming gcc can change that code into a "cmov") might cut down the
> "lots of empty loops" case in half for small values of 'x'.
> 
> There's probably some other better cheap initial guess value estimator.

So I had a wee look at int_sqrt(), and yes Davidlohr's patch makes it
worse for my machine (Xeon E3v5 aka. fancy Skylake).

As in _MUCH_ worse..

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       828,233,295      cycles:u                                                      ( +-  0.04% )
     2,480,095,771      instructions:u            #    2.99  insn per cycle           ( +-  0.00% )

       0.220202758 seconds time elapsed                                          ( +-  0.18% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000   ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       603,809,551      cycles:u                                                      ( +-  1.30% )
     1,677,726,591      instructions:u            #    2.78  insn per cycle           ( +-  0.00% )

       0.160801044 seconds time elapsed  

That is, we went from ~60 cycles to ~82 cycles per invocation.


Now, the patches proposed in this thread do indeed improve matters but
seem to not have merged up until this day:


~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DANSHUL=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       598,827,196      cycles:u                                                      ( +-  0.18% )
     1,697,726,381      instructions:u            #    2.84  insn per cycle           ( +-  0.00% )

       0.159880738 seconds time elapsed                                          ( +-  0.36% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DLINUS=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       601,463,184      cycles:u                                                      ( +-  0.07% )
     1,380,095,976      instructions:u            #    2.29  insn per cycle           ( +-  0.00% )

       0.160788095 seconds time elapsed                                          ( +-  0.55% )

Which basically bring us back to the old performance. So I would
strongly suggest we merge one.


Now, the thing is, Intel CPUs are actually fairly good at fls(), so I
gave Joe's suggestions a spin too..

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       390,420,004      cycles:u                                                      ( +-  0.07% )
     1,141,802,493      instructions:u            #    2.92  insn per cycle           ( +-  0.00% )

       0.105480000 seconds time elapsed                                          ( +-  0.64% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       328,415,775      cycles:u                                                      ( +-  0.15% )
     1,138,579,704      instructions:u            #    3.47  insn per cycle           ( +-  0.00% )

       0.088703205 seconds time elapsed  

Which gets us a real improvement...

Now even with the software fls() fallback from
include/asm-generic/bitops/fls.h there is a significant improvement:


~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 -DSOFTFLS=1  ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

 Performance counter stats for './sqrt' (10 runs):

       375,633,931      cycles:u                                                      ( +-  0.13% )
     1,289,700,013      instructions:u            #    3.43  insn per cycle           ( +-  0.00% )

       0.100596211 seconds time elapsed 


Also, I ran with -DVALIDATE=1 and while the comment says its a 'rough'
estimate of the sqrt() I did not in fact find any value for which we
deviate < INT_MAX.




----------------------[ sqrt.c ]---------------------------

#include <math.h>
#include <stdio.h>

#define BITS_PER_LONG	(sizeof(long) * 8)

#ifndef LOOPS
#define LOOPS 1000000
#endif

#ifdef SOFTFLS

static __always_inline unsigned long fls(unsigned long x)
{
	int r = 32;

	if (!x)
		return 0;

	if (!(x & 0xffff0000u)) {
		x <<= 16;
		r -= 16;
	}
	if (!(x & 0xff000000u)) {
		x <<= 8;
		r -= 8;
	}
	if (!(x & 0xf0000000u)) {
		x <<= 4;
		r -= 4;
	}
	if (!(x & 0xc0000000u)) {
		x <<= 2;
		r -= 2;
	}
	if (!(x & 0x80000000u)) {
		x <<= 1;
		r -= 1;
	}
	return r;
}

#else

static __always_inline unsigned long fls(unsigned long word)
{
	asm("rep; bsr %1,%0"
		: "=r" (word)
		: "rm" (word));
	return BITS_PER_LONG - 1 - word;
}

#endif


#ifdef NEW

unsigned long __attribute__((noinline)) int_sqrt(unsigned long x)
{
	unsigned long b, m, y = 0;

	if (x <= 1)
		return x;

#ifdef LINUS
	m = 1UL << (BITS_PER_LONG/2 - 2);
	if (m < x)
		m <<= BITS_PER_LONG/2;

#else

#ifdef FLS
	m = 1UL << ((fls(x) + 1) & ~1UL);
#else
	m = 1UL << (BITS_PER_LONG - 2);
#endif

#ifdef ANSHUL
	while (m > x)
		m >>= 2;
#endif
#endif

	while (m != 0) {
		b = y + m;
		y >>= 1;

		if (x >= b) {
			x -= b;
			y += m;
		}
		m >>= 2;
	}

	return y;
}

#else

unsigned long __attribute__((noinline)) int_sqrt(unsigned long x)
{
	unsigned long op, res, one;

	op = x;
	res = 0;

	one = 1UL << (BITS_PER_LONG - 2);
	while (one > op)
		one >>= 2;

	while (one != 0) {
		if (op >= res + one) {
			op = op - (res + one);
			res = res +  2 * one;
		}
		res /= 2;
		one /= 4;
	}
	return res;
}

#endif

void main(void)
{
	unsigned long i;

	for (i=0; i<LOOPS; i++) {
		unsigned long a = int_sqrt(i);
		asm volatile("" : : "r" (a));

#ifdef VALIDATE
		{
			unsigned long b = floor(sqrt(i));
			if (a != b)
				printf("%ld %ld %ld\n", i, a, b);
		}
#endif

	}
}

--------------

The only caveat seems to be that our fls() is only defined for 'int' not
long :/


---
 lib/int_sqrt.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..87c3aa360441 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,11 @@
 
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/bitops.h>
 
 /**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - approximation to sqrt
  * @x: integer of which to calculate the sqrt
- *
- * A very rough approximation to the sqrt() function.
  */
 unsigned long int_sqrt(unsigned long x)
 {
@@ -21,7 +20,11 @@ unsigned long int_sqrt(unsigned long x)
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
+	m = 1UL << ((fls(x) + 1) & ~1UL);
+
+	while (m > x)
+		m >>= 2;
+
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-05 19:37             ` Linus Torvalds
@ 2015-02-08 15:39               ` Anshul Garg
  0 siblings, 0 replies; 27+ messages in thread
From: Anshul Garg @ 2015-02-08 15:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g

Dear Mr. linus,

Thanks for quick replies.

Yes performance numbers are not conclusive enough.
So its better to discard this patch as of now.

I will try to explore more in this area.


Thanks & regards
Anshul Garg



On Fri, Feb 6, 2015 at 1:07 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <aksgarg1989@gmail.com> wrote:
>>
>> NOTE ::
>> I have not used gcc optimizations while compilation.
>> With O2 level optimization proposed solution is taking more time.
>
> The thing is, the kernel is compiled with -O2, so that's what matters.
>
> Also, for very tight loops like this, the major costs tend to be very
> subtle microarchitectural details, particularly branch prediction.
> Which in turn end up sometimes depending on just exactly where the
> branches were placed, and even whether two conditional branches were
> in the same 8-byte aligned region etc things (because the branch
> prediction might be done ignoring the low bits of the EIP etc). So not
> only does the exact microarchitecture matter, things that don't *seem*
> like they should matter can change behavior a lot.
>
> My point is really that the performance numbers are very ambiguous.
> The patch may well help in some situations, but hurt in others.
>
>                       Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-05 18:43           ` Anshul Garg
@ 2015-02-05 19:37             ` Linus Torvalds
  2015-02-08 15:39               ` Anshul Garg
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2015-02-05 19:37 UTC (permalink / raw)
  To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g

On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <aksgarg1989@gmail.com> wrote:
>
> NOTE ::
> I have not used gcc optimizations while compilation.
> With O2 level optimization proposed solution is taking more time.

The thing is, the kernel is compiled with -O2, so that's what matters.

Also, for very tight loops like this, the major costs tend to be very
subtle microarchitectural details, particularly branch prediction.
Which in turn end up sometimes depending on just exactly where the
branches were placed, and even whether two conditional branches were
in the same 8-byte aligned region etc things (because the branch
prediction might be done ignoring the low bits of the EIP etc). So not
only does the exact microarchitecture matter, things that don't *seem*
like they should matter can change behavior a lot.

My point is really that the performance numbers are very ambiguous.
The patch may well help in some situations, but hurt in others.

                      Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-05 18:33         ` Linus Torvalds
@ 2015-02-05 18:43           ` Anshul Garg
  2015-02-05 19:37             ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Anshul Garg @ 2015-02-05 18:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On Thu, Feb 5, 2015 at 10:33 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm. I did that too [..]
>
> Side note: one difference in our results (apart from possibly just CPU
> uarch details) is that my loop goes to 100M to make it easier to just
> time it. Which means that my load essentially had about three more
> iterations over nonzero data.
>
>                          Linus

I have also done the same testing on 100 million numbers.
Attaching source codes.

Below is the result ::

int_sqrt_old - current
int_sqrt_new -  With proposed change

anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_old

real    0m41.895s
user    0m36.490s
sys    0m0.365s

anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_new

real    0m39.491s
user    0m36.495s
sys    0m0.338s


I have run this test on Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz VMWare Machine.
Please check if i am doing anything wrong.

NOTE ::
I have not used gcc optimizations while compilation.
With O2 level optimization proposed solution is taking more time.

[-- Attachment #2: int_sqrt_old.c --]
[-- Type: text/x-csrc, Size: 721 bytes --]

/*
 * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com>
 *
 *  Based on the shift-and-subtract algorithm for computing integer
 *  square root from Guy L. Steele.
 */

#include <stdio.h>
/**
 * int_sqrt - rough approximation to sqrt
 * @x: integer of which to calculate the sqrt
 *
 * A very rough approximation to the sqrt() function.
 */
#define BITS_PER_LONG (8*sizeof(long))

unsigned long int_sqrt(unsigned long x)
{
	unsigned long b, m, y = 0;

	if (x <= 1)
		return x;

	m = 1UL << (BITS_PER_LONG - 2);
	while (m != 0) {
		b = y + m;
		y >>= 1;

		if (x >= b) {
			x -= b;
			y += m;
		}
		m >>= 2;
	}

	return y;
}

int main()
{
	unsigned long n = 1;
	for(;n<=100000000;++n)
		int_sqrt(n);
	return 0;
}

[-- Attachment #3: int_sqrt_new.c --]
[-- Type: text/x-csrc, Size: 750 bytes --]

/*
 * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com>
 *
 *  Based on the shift-and-subtract algorithm for computing integer
 *  square root from Guy L. Steele.
 */

#include <stdio.h>

#define BITS_PER_LONG (8*sizeof(long))

/**
 * int_sqrt - rough approximation to sqrt
 * @x: integer of which to calculate the sqrt
 *
 * A very rough approximation to the sqrt() function.
 */
unsigned long int_sqrt(unsigned long x)
{
	unsigned long b, m, y = 0;

	if (x <= 1)
		return x;

	m = 1UL << (BITS_PER_LONG - 2);

	while(m > x )
		m >>= 2;

	while (m != 0) {
		b = y + m;
		y >>= 1;

		if (x >= b) {
			x -= b;
			y += m;
		}
		m >>= 2;
	}

	return y;
}

int main()
{
	unsigned long n = 1;
	for(;n<=100000000;++n)
		int_sqrt(n);
	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-05 18:20       ` Linus Torvalds
@ 2015-02-05 18:33         ` Linus Torvalds
  2015-02-05 18:43           ` Anshul Garg
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2015-02-05 18:33 UTC (permalink / raw)
  To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g

On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. I did that too [..]

Side note: one difference in our results (apart from possibly just CPU
uarch details) is that my loop goes to 100M to make it easier to just
time it. Which means that my load essentially had about three more
iterations over nonzero data.

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-03 15:42     ` Anshul Garg
@ 2015-02-05 18:20       ` Linus Torvalds
  2015-02-05 18:33         ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2015-02-05 18:20 UTC (permalink / raw)
  To: Anshul Garg; +Cc: Davidlohr Bueso, Linux Kernel Mailing List, anshul.g

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]

On Tue, Feb 3, 2015 at 7:42 AM, Anshul Garg <aksgarg1989@gmail.com> wrote:
>
> I have done profiling of int_sqrt function using perf tool for 10 times.
> For this purpose i have created a userspace program which uses sqrt function
> from 1 to a million.

Hmm. I did that too, and it doesn't improve things for me. In fact, it
makes it slower.

    [torvalds@i7 ~]$ gcc -Wall -O2 -DREDUCE int_sqrt.c ; time ./a.out

    real 0m2.098s
    user 0m2.095s
    sys 0m0.000s

    [torvalds@i7 ~]$ gcc -Wall -O2 int_sqrt.c ; time ./a.out

    real 0m1.886s
    user 0m1.883s
    sys 0m0.000s

and the profile shows that 35% of the time is spent on that branch
back of the initial reduction loop.

In contrast, my suggested "reduce just once" does seem to improve things:

    [torvalds@i7 ~]$ gcc -Wall -O2 -DONCE int_sqrt.c ; time ./a.out

    real 0m1.436s
    user 0m1.434s
    sys 0m0.000s

but it's kind of hacky.

NOTE! This probably depends a lot on microarchitecture details,
including very much branch predictor etc. And I didn't actually check
that it gives the right result, but I do think that this optimization
needs to be looked at more if we want to do it.

I was running this on an i7-4770S, fwiw.

Attached is the stupid test-program I used to do the above. Maybe I
did something wrong.

                         Linus

[-- Attachment #2: int_sqrt.c --]
[-- Type: text/x-csrc, Size: 951 bytes --]

/*
 * Copyright (C) 2013 Davidlohr Bueso <davidlohr.bueso@hp.com>
 *
 *  Based on the shift-and-subtract algorithm for computing integer
 *  square root from Guy L. Steele.
 */

#define BITS_PER_LONG (8*sizeof(long))

/**
 * int_sqrt - rough approximation to sqrt
 * @x: integer of which to calculate the sqrt
 *
 * A very rough approximation to the sqrt() function.
 */
unsigned long __attribute__((noinline)) int_sqrt(unsigned long x)
{
	unsigned long b, m, y = 0;

	if (x <= 1)
		return x;

	m = 1UL << (BITS_PER_LONG - 2);

#ifdef REDUCE
	while (m > x)
		m >>= 2;
#elif defined(ONCE)
	{
		unsigned long n = m >> (BITS_PER_LONG/2);
		m = (n >= x) ? n : m;
	}
#endif
	while (m != 0) {
		b = y + m;
		y >>= 1;

		if (x >= b) {
			x -= b;
			y += m;
		}
		m >>= 2;
	}

	return y;
}

int main(int argc, char **argv)
{
	unsigned long i;

	for (i = 0; i < 100000000; i++) {
		unsigned long a = int_sqrt(i);
		asm volatile("": : "r" (a));
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-03  4:41     ` Davidlohr Bueso
  2015-02-03  4:57       ` Davidlohr Bueso
@ 2015-02-03 15:54       ` Anshul Garg
  1 sibling, 0 replies; 27+ messages in thread
From: Anshul Garg @ 2015-02-03 15:54 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: Linus Torvalds, Linux Kernel Mailing List, anshul.g

On Tue, Feb 3, 2015 at 10:11 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
>> IOW, instead of
>>
>>   m = 1UL << (BITS_PER_LONG - 2);
>>
>> perhaps something like
>>
>>    m = 1UL << (BITS_PER_LONG/2- 2);
>>    if (m < x)
>>        m <<= BITS_PER_LONG/2;

On doing this change also extra instruction(shown below) will execute
if x is small(say 10000).

b = y + m;
y >>= 1;
if (x >= b) {

Although i understand your point that we can set value of m according
to range of x.
But in future also range of x can change if apart from memory some
other module starts
using this function more extensively.
So in worst case it will be almost same as current implementation.

So i think its better to scale m according to x by doing
while(m > x)
m>>=2;
rather than according to range of x.

Thanks
Anshul Garg


>>
>> (assuming gcc can change that code into a "cmov") might cut down the
>> "lots of empty loops" case in half for small values of 'x'.
>
> Makes a lot of sense.
>

>> There's probably some other better cheap initial guess value estimator.
>
> Just to get a feeling for the normal arg range in the non-driver parts
> that use this thing:
>
> fs/ceph/super.h:        congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
> fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
> fs/nfsd/nfscache.c:     limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10);
> kernel/rcu/tree_plugin.h:               ls = int_sqrt(nr_cpu_ids);
> mm/memcontrol.c:                inactive_ratio = int_sqrt(10 * gb);
> mm/page_alloc.c:                ratio = int_sqrt(10 * gb);
> mm/page_alloc.c:        new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
>
> So mostly values that scale according to mem or cpu.
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-03  4:47   ` Davidlohr Bueso
@ 2015-02-03 15:42     ` Anshul Garg
  2015-02-05 18:20       ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Anshul Garg @ 2015-02-03 15:42 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: Linus Torvalds, Linux Kernel Mailing List, anshul.g

On Tue, Feb 3, 2015 at 10:17 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
>> Hmm. I don't disagree, but would like some more feedback.
>>
>> Davidlohr - you were the person to touch this function last (commit
>> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
>> you did so for performance reasons. And in fact, when you did that,
>> you removed that initial loop:
>>
>> -       one = 1UL << (BITS_PER_LONG - 2);
>> -       while (one > op)
>> -               one >>= 2;
>>
>> but I'm not sure that was actually all that conscious, I think the
>> real optimization was the changes inside the loop to make the final
>> real loop faster and simpler.
>
> I missed that. And yes, the real optimization should be in the loop.
>
>>
>> Also, you had performance numbers, so presumably a test harness for it
>> all. It probably depends a lot on the actual distribution of argument
>> values, of course, but it would be good to accompany the patch with
>> actual real numbers like lasty time.
>
> Aha. In my case I recall I ran a usersapce program using each function
> from 1 to a million, and throwing perf at it for 10 times.
>
I have done profiling of int_sqrt function using perf tool for 10 times.
For this purpose i have created a userspace program which uses sqrt function
from 1 to a million.

int_sqrt_old -> current algorithm version
int_sqrt_new -> with proposed change

these results are for BITS_PER_LONG=64.

Performance counter stats for './int_sqrt_old' (10 runs):

        460.944061 task-clock (msec)         #    0.969 CPUs utilized
( +-  1.72% )
64 context-switches          #    0.139 K/sec ( +-  2.27% )
                 0 cpu-migrations            #    0.000 K/sec
               132 page-faults               #    0.286 K/sec
   <not supported> cycles
   <not supported> stalled-cycles-frontend
   <not supported> stalled-cycles-backend
   <not supported> instructions
   <not supported> branches
   <not supported> branch-misses

       0.475795341 seconds time elapsed( +-  3.20% )



 Performance counter stats for './int_sqrt_new' (10 runs):

        401.782119 task-clock (msec)         #    0.974 CPUs utilized(
+-  1.55% )
                57 context-switches          #    0.141 K/sec( +-  1.92% )
                 0 cpu-migrations            #    0.000 K/sec
               132 page-faults               #    0.329 K/sec
   <not supported> cycles
   <not supported> stalled-cycles-frontend
   <not supported> stalled-cycles-backend
   <not supported> instructions
   <not supported> branches
   <not supported> branch-misses

       0.412593296 seconds time elapsed( +-  2.03% )

As per profiling definitely there is improvement in algorithm timing.

>> (I'm also not entirely sure what uses int_sqrt() that ends up being so
>> performance-critical, so it would be good to document that too, since
>> that probably also matters for the "what's the normal argument range"
>> question..)
>
> It's not a big deal afaik.
>
> Thanks,
> Davidlohr
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-03  4:41     ` Davidlohr Bueso
@ 2015-02-03  4:57       ` Davidlohr Bueso
  2015-02-03 15:54       ` Anshul Garg
  1 sibling, 0 replies; 27+ messages in thread
From: Davidlohr Bueso @ 2015-02-03  4:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g

On Mon, 2015-02-02 at 20:41 -0800, Davidlohr Bueso wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> > IOW, instead of
> > 
> >   m = 1UL << (BITS_PER_LONG - 2);
> > 
> > perhaps something like
> > 
> >    m = 1UL << (BITS_PER_LONG/2- 2);
> >    if (m < x)
> >        m <<= BITS_PER_LONG/2;

Assuming small values mostly, we could also try more aggressive
estimators for BITS_PER_LONG == 64. But then again, it probably doesn't
matter.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 19:00 ` Linus Torvalds
  2015-02-02 19:13   ` Linus Torvalds
@ 2015-02-03  4:47   ` Davidlohr Bueso
  2015-02-03 15:42     ` Anshul Garg
  1 sibling, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-02-03  4:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g

On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
> Hmm. I don't disagree, but would like some more feedback.
> 
> Davidlohr - you were the person to touch this function last (commit
> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
> you did so for performance reasons. And in fact, when you did that,
> you removed that initial loop:
> 
> -       one = 1UL << (BITS_PER_LONG - 2);
> -       while (one > op)
> -               one >>= 2;
> 
> but I'm not sure that was actually all that conscious, I think the
> real optimization was the changes inside the loop to make the final
> real loop faster and simpler.

I missed that. And yes, the real optimization should be in the loop.

> 
> Also, you had performance numbers, so presumably a test harness for it
> all. It probably depends a lot on the actual distribution of argument
> values, of course, but it would be good to accompany the patch with
> actual real numbers like lasty time.

Aha. In my case I recall I ran a usersapce program using each function
from 1 to a million, and throwing perf at it for 10 times.

> (I'm also not entirely sure what uses int_sqrt() that ends up being so
> performance-critical, so it would be good to document that too, since
> that probably also matters for the "what's the normal argument range"
> question..)

It's not a big deal afaik.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 19:13   ` Linus Torvalds
@ 2015-02-03  4:41     ` Davidlohr Bueso
  2015-02-03  4:57       ` Davidlohr Bueso
  2015-02-03 15:54       ` Anshul Garg
  2017-07-20 11:24     ` Peter Zijlstra
  1 sibling, 2 replies; 27+ messages in thread
From: Davidlohr Bueso @ 2015-02-03  4:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g

On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> IOW, instead of
> 
>   m = 1UL << (BITS_PER_LONG - 2);
> 
> perhaps something like
> 
>    m = 1UL << (BITS_PER_LONG/2- 2);
>    if (m < x)
>        m <<= BITS_PER_LONG/2;
> 
> (assuming gcc can change that code into a "cmov") might cut down the
> "lots of empty loops" case in half for small values of 'x'.

Makes a lot of sense.

> There's probably some other better cheap initial guess value estimator.

Just to get a feeling for the normal arg range in the non-driver parts
that use this thing:

fs/ceph/super.h:        congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
fs/nfsd/nfscache.c:     limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10);
kernel/rcu/tree_plugin.h:               ls = int_sqrt(nr_cpu_ids);
mm/memcontrol.c:                inactive_ratio = int_sqrt(10 * gb);
mm/page_alloc.c:                ratio = int_sqrt(10 * gb);
mm/page_alloc.c:        new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);

So mostly values that scale according to mem or cpu.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 19:10 ` Joe Perches
@ 2015-02-02 19:39   ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2015-02-02 19:39 UTC (permalink / raw)
  To: Joe Perches; +Cc: Anshul Garg, Linux Kernel Mailing List, anshul.g

On Mon, Feb 2, 2015 at 11:10 AM, Joe Perches <joe@perches.com> wrote:
>
> Perhaps removing the while and using fls(x) would be better.

No. fls is often slow, and you then do have to also round it down to
an even bit number etc, so it gets somewhat complex too.

We *probably* have some argument range that we care more about, which
is why I'd like to know what the profile is that triggered this
optimization, and what the common argument range is.

For example, one user is the cpu frequency governor, which seems to
try to predict the length of the next sleep. The values can probably
be anything (it has protections for overflowing into "long long"), but
presumably the onyl time we care about performance is when it's called
very very often, in which case I'm going out on a limb and saying that
the intervals will be short, and the standard deviation of those
intervals will also be small. So *maybe* that function really only
cares about performance when we have small arguments. I dunno.

                        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 19:00 ` Linus Torvalds
@ 2015-02-02 19:13   ` Linus Torvalds
  2015-02-03  4:41     ` Davidlohr Bueso
  2017-07-20 11:24     ` Peter Zijlstra
  2015-02-03  4:47   ` Davidlohr Bueso
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2015-02-02 19:13 UTC (permalink / raw)
  To: Anshul Garg, Davidlohr Bueso; +Cc: Linux Kernel Mailing List, anshul.g

On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> (I'm also not entirely sure what uses int_sqrt() that ends up being so
> performance-critical, so it would be good to document that too, since
> that probably also matters for the "what's the normal argument range"
> question..)

... it's also not entirely clear that we need a whole new loop. We
might just instead start off with a better guess for 'm' using some
calculation that might be doable with a single conditional move
instruction instead of a loop. Because I suspect that the inevitable
branch misprediction of a new loop is likely as expensive as a few
iterations through the core one.

IOW, instead of

  m = 1UL << (BITS_PER_LONG - 2);

perhaps something like

   m = 1UL << (BITS_PER_LONG/2- 2);
   if (m < x)
       m <<= BITS_PER_LONG/2;

(assuming gcc can change that code into a "cmov") might cut down the
"lots of empty loops" case in half for small values of 'x'.

There's probably some other better cheap initial guess value estimator.

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 17:12 Anshul Garg
  2015-02-02 19:00 ` Linus Torvalds
@ 2015-02-02 19:10 ` Joe Perches
  2015-02-02 19:39   ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Joe Perches @ 2015-02-02 19:10 UTC (permalink / raw)
  To: Anshul Garg; +Cc: linux-kernel, anshul.g, torvalds

On Mon, 2015-02-02 at 09:12 -0800, Anshul Garg wrote:
> From: Anshul Garg <aksgarg1989@gmail.com>
> 
> Unnecessary instructions are executing even though m is
> greater than x so added logic to make m less than equal to
> x before performing these operations.
> 
> Signed-off-by: Anshul Garg <aksgarg1989@gmail.com>
> ---
>  lib/int_sqrt.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
[]
> @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
>  		return x;
>  
>  	m = 1UL << (BITS_PER_LONG - 2);
> +
> +	while (m > x)
> +		m >>= 2;

Perhaps removing the while and using fls(x) would be better.

Perhaps something like:

	m = 1UL << min(sizeof(unsigned long) == sizeof(u64) ?
		       fls64(x) : fls(x),
		       BITS_PER_LONG - 2);

>  	while (m != 0) {
>  		b = y + m;
>  		y >>= 1;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] lib/int_sqrt.c: Optimize square root function
  2015-02-02 17:12 Anshul Garg
@ 2015-02-02 19:00 ` Linus Torvalds
  2015-02-02 19:13   ` Linus Torvalds
  2015-02-03  4:47   ` Davidlohr Bueso
  2015-02-02 19:10 ` Joe Perches
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2015-02-02 19:00 UTC (permalink / raw)
  To: Anshul Garg, Davidlohr Bueso; +Cc: Linux Kernel Mailing List, anshul.g

Hmm. I don't disagree, but would like some more feedback.

Davidlohr - you were the person to touch this function last (commit
30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
you did so for performance reasons. And in fact, when you did that,
you removed that initial loop:

-       one = 1UL << (BITS_PER_LONG - 2);
-       while (one > op)
-               one >>= 2;

but I'm not sure that was actually all that conscious, I think the
real optimization was the changes inside the loop to make the final
real loop faster and simpler.

Also, you had performance numbers, so presumably a test harness for it
all. It probably depends a lot on the actual distribution of argument
values, of course, but it would be good to accompany the patch with
actual real numbers like lasty time.

(I'm also not entirely sure what uses int_sqrt() that ends up being so
performance-critical, so it would be good to document that too, since
that probably also matters for the "what's the normal argument range"
question..)

                           Linus


On Mon, Feb 2, 2015 at 9:12 AM, Anshul Garg <aksgarg1989@gmail.com> wrote:
> From: Anshul Garg <aksgarg1989@gmail.com>
>
> Unnecessary instructions are executing even though m is
> greater than x so added logic to make m less than equal to
> x before performing these operations.
>
> Signed-off-by: Anshul Garg <aksgarg1989@gmail.com>
> ---
>  lib/int_sqrt.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
> index 1ef4cc3..64ae722 100644
> --- a/lib/int_sqrt.c
> +++ b/lib/int_sqrt.c
> @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
>                 return x;
>
>         m = 1UL << (BITS_PER_LONG - 2);
> +
> +       while (m > x)
> +               m >>= 2;
>         while (m != 0) {
>                 b = y + m;
>                 y >>= 1;
> --
> 1.7.9.5
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> http://www.avast.com
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] lib/int_sqrt.c: Optimize square root function
@ 2015-02-02 17:12 Anshul Garg
  2015-02-02 19:00 ` Linus Torvalds
  2015-02-02 19:10 ` Joe Perches
  0 siblings, 2 replies; 27+ messages in thread
From: Anshul Garg @ 2015-02-02 17:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: aksgarg1989, anshul.g, torvalds

From: Anshul Garg <aksgarg1989@gmail.com>

Unnecessary instructions are executing even though m is
greater than x so added logic to make m less than equal to
x before performing these operations.

Signed-off-by: Anshul Garg <aksgarg1989@gmail.com>
---
 lib/int_sqrt.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc3..64ae722 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
 		return x;
 
 	m = 1UL << (BITS_PER_LONG - 2);
+
+	while (m > x)
+		m >>= 2;
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;
-- 
1.7.9.5


---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-07-21 13:33 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29 13:35 [PATCH] lib/int_sqrt.c: Optimize square root function Anshul Garg
2015-02-02 17:12 Anshul Garg
2015-02-02 19:00 ` Linus Torvalds
2015-02-02 19:13   ` Linus Torvalds
2015-02-03  4:41     ` Davidlohr Bueso
2015-02-03  4:57       ` Davidlohr Bueso
2015-02-03 15:54       ` Anshul Garg
2017-07-20 11:24     ` Peter Zijlstra
2017-07-20 11:52       ` Joe Perches
2017-07-20 14:13         ` Peter Zijlstra
2017-07-20 15:27       ` Peter Zijlstra
2017-07-20 18:31       ` Linus Torvalds
2017-07-20 22:34         ` Peter Zijlstra
2017-07-20 23:24           ` Linus Torvalds
2017-07-21 11:40             ` Peter Zijlstra
2017-07-21 12:15               ` Joe Perches
2017-07-21 13:26                 ` Peter Zijlstra
2017-07-21 13:33                   ` Peter Zijlstra
2015-02-03  4:47   ` Davidlohr Bueso
2015-02-03 15:42     ` Anshul Garg
2015-02-05 18:20       ` Linus Torvalds
2015-02-05 18:33         ` Linus Torvalds
2015-02-05 18:43           ` Anshul Garg
2015-02-05 19:37             ` Linus Torvalds
2015-02-08 15:39               ` Anshul Garg
2015-02-02 19:10 ` Joe Perches
2015-02-02 19:39   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.