linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] lib/int_sqrt: Fix, optimize and document
@ 2017-07-24 15:16 Peter Zijlstra
  2017-07-24 15:16 ` [PATCH 1/3] lib/int_sqrt: Optimize small argument Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-24 15:16 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: dave, aksgarg1989, tglx, mingo, will.deacon, joe, peterz, linux-kernel

Hi,

Here are a few patches that should improve things lib/int_sqrt. As stated
elsewhere; I'm looking at using int_sqrt() to calculate the stdev on a normal
distribution and am expecting the input values to be smallish.

In any case, these optimizations should work fine for large numbers too. And
if you have a find-last-set or count-leading-zeros instruction they rock ;-)

I can post the tool used to generate the numbers or do a patch to add it to
tools/testing/ if people care. The cold numbers are fairly sensitive to code
layout (GCC version, random changes etc..), so I expect the branch predictor of
my SKL is only partially confused or there's other things at play. However the
general trend in the numbers seems fairly stable.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] lib/int_sqrt: Optimize small argument
  2017-07-24 15:16 [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Peter Zijlstra
@ 2017-07-24 15:16 ` Peter Zijlstra
  2017-07-24 15:16 ` [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-24 15:16 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: dave, aksgarg1989, tglx, mingo, will.deacon, joe, peterz, linux-kernel

[-- Attachment #1: peterz-int_sqrt-opt-1.patch --]
[-- Type: text/plain, Size: 1613 bytes --]

The current int_sqrt() computation is sub-optimal for the case of
small @x.

In this case, the compute loop:

	while (m != 0) {
		b = y + m;
		y >>= 1;

		if (x >= b) {
			x -= b;
			y += m;
		}
		m >>= 2;
	}

can be reduced to:

	while (m > x)
		m >>= 2;

Because y==0, b==m and until x>=m y will remain 0.

And while this is computationally equivalent, it runs much faster
because there's less code, in particular less branches.

      cycles:                 branches:              branch-misses:

OLD:

hot:   45.109444 +- 0.044117  44.333392 +- 0.002254  0.018723 +- 0.000593
cold: 187.737379 +- 0.156678  44.333407 +- 0.002254  6.272844 +- 0.004305

PRE:

hot:   67.937492 +- 0.064124  66.999535 +- 0.000488  0.066720 +- 0.001113
cold: 232.004379 +- 0.332811  66.999527 +- 0.000488  6.914634 +- 0.006568

POST:

hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219

Averages computed over all values <128k using a LFSR to generate
order. Cold numbers have a LFSR based branch trace buffer 'confuser'
ran between each int_sqrt() invocation.

Fixes: 30493cc9dddb ("lib/int_sqrt.c: optimize square root algorithm")
Suggested-by: Anshul Garg <aksgarg1989@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 lib/int_sqrt.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
 		return x;
 
 	m = 1UL << (BITS_PER_LONG - 2);
+	while (m > x)
+		m >>= 2;
+
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-24 15:16 [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Peter Zijlstra
  2017-07-24 15:16 ` [PATCH 1/3] lib/int_sqrt: Optimize small argument Peter Zijlstra
@ 2017-07-24 15:16 ` Peter Zijlstra
  2017-07-24 17:35   ` Linus Torvalds
  2017-07-25 11:50   ` Will Deacon
  2017-07-24 15:16 ` [PATCH 3/3] lib/int_sqrt: Adjust comments Peter Zijlstra
  2017-07-24 15:42 ` [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Joe Perches
  3 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-24 15:16 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: dave, aksgarg1989, tglx, mingo, will.deacon, joe, peterz, linux-kernel

[-- Attachment #1: peterz-int_sqrt-opt-2.patch --]
[-- Type: text/plain, Size: 1770 bytes --]

The initial value (@m) compute is:

	m = 1UL << (BITS_PER_LONG - 2);
	while (m > x)
		m >>= 2;

Which is a linear search for the highest even bit smaller or equal to @x
We can implement this using a binary search using __fls() (or better
when its hardware implemented).

	m = 1UL << (__fls(x) & ~1UL);

Especially for small values of @x; which are the more common
arguments; the linear search is near to worst case, while the binary
search of __fls() is a constant 6 branches.

      cycles:                 branches:              branch-misses:

PRE:

hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219

SOFTWARE FLS:

hot:   29.576176 +- 0.028850  26.666730 +- 0.004511  0.019463 +- 0.000663
cold: 165.947136 +- 0.188406  26.666746 +- 0.004511  6.133897 +- 0.004386

HARDWARE FLS:

hot:   24.720922 +- 0.025161  20.666784 +- 0.004509  0.020836 +- 0.000677
cold: 132.777197 +- 0.127471  20.666776 +- 0.004509  5.080285 +- 0.003874

Averages computed over all values <128k using a LFSR to generate
order. Cold numbers have a LFSR based branch trace buffer 'confuser'
ran between each int_sqrt() invocation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 lib/int_sqrt.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,6 +7,7 @@
 
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/bitops.h>
 
 /**
  * int_sqrt - rough approximation to sqrt
@@ -21,10 +22,7 @@ unsigned long int_sqrt(unsigned long x)
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
-	while (m > x)
-		m >>= 2;
-
+	m = 1UL << (__fls(x) & ~1UL);
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/3] lib/int_sqrt: Adjust comments
  2017-07-24 15:16 [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Peter Zijlstra
  2017-07-24 15:16 ` [PATCH 1/3] lib/int_sqrt: Optimize small argument Peter Zijlstra
  2017-07-24 15:16 ` [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Peter Zijlstra
@ 2017-07-24 15:16 ` Peter Zijlstra
  2017-07-24 15:42 ` [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Joe Perches
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-24 15:16 UTC (permalink / raw)
  To: torvalds, akpm
  Cc: dave, aksgarg1989, tglx, mingo, will.deacon, joe, peterz, linux-kernel

[-- Attachment #1: peterz-int_sqrt-opt-3.patch --]
[-- Type: text/plain, Size: 638 bytes --]

Our current int_sqrt() is not rough nor any approximation; it
calculates the exact value of: floor(sqrt()). Document this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 lib/int_sqrt.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -10,10 +10,10 @@
 #include <linux/bitops.h>
 
 /**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - computes the integer square root
  * @x: integer of which to calculate the sqrt
  *
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(x))
  */
 unsigned long int_sqrt(unsigned long x)
 {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] lib/int_sqrt: Fix, optimize and document
  2017-07-24 15:16 [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-07-24 15:16 ` [PATCH 3/3] lib/int_sqrt: Adjust comments Peter Zijlstra
@ 2017-07-24 15:42 ` Joe Perches
  3 siblings, 0 replies; 10+ messages in thread
From: Joe Perches @ 2017-07-24 15:42 UTC (permalink / raw)
  To: Peter Zijlstra, torvalds, akpm
  Cc: dave, aksgarg1989, tglx, mingo, will.deacon, linux-kernel

On Mon, 2017-07-24 at 17:16 +0200, Peter Zijlstra wrote:
> Here are a few patches that should improve things lib/int_sqrt.

Thanks Peter.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-24 15:16 ` [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Peter Zijlstra
@ 2017-07-24 17:35   ` Linus Torvalds
  2017-07-25  8:17     ` Peter Zijlstra
  2017-07-25 11:50   ` Will Deacon
  1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2017-07-24 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Davidlohr Bueso, Anshul Garg, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Joe Perches, Linux Kernel Mailing List

Ack. You have numbers, it's all good.

Except I'd still want you to comment on why you cared and about which
piece of your upcoming code this is going to matter for, ok?

              Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-24 17:35   ` Linus Torvalds
@ 2017-07-25  8:17     ` Peter Zijlstra
  2017-07-25 15:43       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-25  8:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davidlohr Bueso, Anshul Garg, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Joe Perches, Linux Kernel Mailing List

On Mon, Jul 24, 2017 at 10:35:56AM -0700, Linus Torvalds wrote:
> Ack. You have numbers, it's all good.

Thanks!

> Except I'd still want you to comment on why you cared and about which
> piece of your upcoming code this is going to matter for, ok?

I did an RFC here:

  https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3ane@hirez.programming.kicks-ass.net

And that is the patch that, through Thomas asking me about our sqrt(),
kick started these here patches.

There are a few more sites that would need similar treatment, but I've
not gone through the entire idle predictor yet.


Basically the observation is that, for performance, we seem to pick too
deep an idle state. The result is that the exit latency from this state
is higher than we'd like and performance hurts because of that.

The thinking is that if you estimate the average idle duration, you'll
be too long 50% of the time -- that is after all a fundamental part of
being the average. My proposed solution in that patch is computing the
value for which we're too long less than n%, in statistic speak:

  P(X < x)

Which is given by the CDF(x). In any case, assuming a normal
distribution, you end up with something like:

  avg - Z * stdev

Where Z depends on our cut-off and is basically a table lookup.  The
whole sqrt() comes from having to compute the stdev, as that is the
square root of the variance.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-24 15:16 ` [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Peter Zijlstra
  2017-07-24 17:35   ` Linus Torvalds
@ 2017-07-25 11:50   ` Will Deacon
  1 sibling, 0 replies; 10+ messages in thread
From: Will Deacon @ 2017-07-25 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, akpm, dave, aksgarg1989, tglx, mingo, joe, linux-kernel

Hi Peter,

On Mon, Jul 24, 2017 at 05:16:32PM +0200, Peter Zijlstra wrote:
> The initial value (@m) compute is:
> 
> 	m = 1UL << (BITS_PER_LONG - 2);
> 	while (m > x)
> 		m >>= 2;
> 
> Which is a linear search for the highest even bit smaller or equal to @x
> We can implement this using a binary search using __fls() (or better
> when its hardware implemented).
> 
> 	m = 1UL << (__fls(x) & ~1UL);
> 
> Especially for small values of @x; which are the more common
> arguments; the linear search is near to worst case, while the binary
> search of __fls() is a constant 6 branches.
> 
>       cycles:                 branches:              branch-misses:
> 
> PRE:
> 
> hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
> cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219
> 
> SOFTWARE FLS:
> 
> hot:   29.576176 +- 0.028850  26.666730 +- 0.004511  0.019463 +- 0.000663
> cold: 165.947136 +- 0.188406  26.666746 +- 0.004511  6.133897 +- 0.004386
> 
> HARDWARE FLS:
> 
> hot:   24.720922 +- 0.025161  20.666784 +- 0.004509  0.020836 +- 0.000677
> cold: 132.777197 +- 0.127471  20.666776 +- 0.004509  5.080285 +- 0.003874
> 
> Averages computed over all values <128k using a LFSR to generate
> order. Cold numbers have a LFSR based branch trace buffer 'confuser'
> ran between each int_sqrt() invocation.

The hardware fls version works nicely for arm64, where it can be implemented
using the clz instruction (via the __builtin_clzl intrinsic).

Acked-by: Will Deacon <will.deacon@arm.com>

Cheers,

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-25  8:17     ` Peter Zijlstra
@ 2017-07-25 15:43       ` Linus Torvalds
  2017-07-25 16:08         ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2017-07-25 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Davidlohr Bueso, Anshul Garg, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Joe Perches, Linux Kernel Mailing List

On Tue, Jul 25, 2017 at 1:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jul 24, 2017 at 10:35:56AM -0700, Linus Torvalds wrote:
>> Ack. You have numbers, it's all good.
>
> Thanks!
>
>> Except I'd still want you to comment on why you cared and about which
>> piece of your upcoming code this is going to matter for, ok?
>
> I did an RFC here:
>
>   https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3ane@hirez.programming.kicks-ass.net

Christ Peter.

I meant add the explanation to the commit log, to explain *why* you
care about this function that nobody has ever cared about before.

Please. Document your patches. Don't make people ask for explanations.

I've asked now *three* times about this, and still the patch
submission had no explanation at all about why this mattered.

How many times do people have to ask before you start saying "oh,
maybe I should explain why.." in your patch.

                   Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute
  2017-07-25 15:43       ` Linus Torvalds
@ 2017-07-25 16:08         ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2017-07-25 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davidlohr Bueso, Anshul Garg, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Joe Perches, Linux Kernel Mailing List

On Tue, Jul 25, 2017 at 08:43:39AM -0700, Linus Torvalds wrote:

> Please. Document your patches. Don't make people ask for explanations.

Yes, I _am_ a tad dense these days.. Let me go do that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-25 16:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24 15:16 [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Peter Zijlstra
2017-07-24 15:16 ` [PATCH 1/3] lib/int_sqrt: Optimize small argument Peter Zijlstra
2017-07-24 15:16 ` [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Peter Zijlstra
2017-07-24 17:35   ` Linus Torvalds
2017-07-25  8:17     ` Peter Zijlstra
2017-07-25 15:43       ` Linus Torvalds
2017-07-25 16:08         ` Peter Zijlstra
2017-07-25 11:50   ` Will Deacon
2017-07-24 15:16 ` [PATCH 3/3] lib/int_sqrt: Adjust comments Peter Zijlstra
2017-07-24 15:42 ` [PATCH 0/3] lib/int_sqrt: Fix, optimize and document Joe Perches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).