All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Optimize int_sqrt for small values for faster idle
@ 2016-01-28 21:42 Andi Kleen
  2016-01-28 22:04 ` kbuild test robot
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Andi Kleen @ 2016-01-28 21:42 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The menu cpuidle governor does at least two int_sqrt() each time
we go into idle in get_typical_interval to compute stddev

int_sqrts take 100-120 cycles each. Short idle latency is important
for many workloads.

I instrumented the function on my workstation and most values are
16bit only and most others 32bit (50% percentile is 122094,
75% is 3699533).

sqrt is implemented by starting with an initial estimation,
and then iterating. int_sqrt currently only uses a fixed
estimating which is good for 64bits worth of input.

This patch adds some checks at the beginning to start with
a better estimate for values fitting in 8, 16bit and 32bit.
This makes int_sqrt between 60+% faster for values in 16bit,
and still somewhat faster (between 10 and 30%) for larger values
upto 32bit. Full 64bit is slightly slower.

This optimizes the short idle calls and does not hurt the
long sleep (which probably do not care) much.

An alternative would be a full table drive approach, or
trying some inverted sqrt optimization, but this simple change
already seems to have a good payoff.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 lib/int_sqrt.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc3..2479ccf 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
+	if (x <= 0xffff) {
+		if (m <= 0xff)
+			m = 1UL << (8 - 2);
+		else
+			m = 1UL << (16 - 2);
+	} else if (x <= 0xffffffff)
+		m = 1UL << (32 - 2);
+	else
+		m = 1UL << (BITS_PER_LONG - 2);
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
@ 2016-01-28 22:04 ` kbuild test robot
  2016-01-28 22:11 ` kbuild test robot
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2016-01-28 22:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kbuild-all, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1671 bytes --]

Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: x86_64-randconfig-x015-01270835 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25:6: warning: 'm' may be used uninitialized in this function [-Wmaybe-uninitialized]
      if (m <= 0xff)
         ^

vim +/m +25 lib/int_sqrt.c

     9	#include <linux/export.h>
    10	
    11	/**
    12	 * int_sqrt - rough approximation to sqrt
    13	 * @x: integer of which to calculate the sqrt
    14	 *
    15	 * A very rough approximation to the sqrt() function.
    16	 */
    17	unsigned long int_sqrt(unsigned long x)
    18	{
    19		unsigned long b, m, y = 0;
    20	
    21		if (x <= 1)
    22			return x;
    23	
    24		if (x <= 0xffff) {
  > 25			if (m <= 0xff)
    26				m = 1UL << (8 - 2);
    27			else
    28				m = 1UL << (16 - 2);
    29		} else if (x <= 0xffffffff)
    30			m = 1UL << (32 - 2);
    31		else
    32			m = 1UL << (BITS_PER_LONG - 2);
    33		while (m != 0) {

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 19426 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
  2016-01-28 22:04 ` kbuild test robot
@ 2016-01-28 22:11 ` kbuild test robot
  2016-01-28 22:15 ` Joe Perches
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2016-01-28 22:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kbuild-all, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1627 bytes --]

Hi Andi,

[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160128]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Andi-Kleen/Optimize-int_sqrt-for-small-values-for-faster-idle/20160129-054629
config: avr32-atngw100_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=avr32 

All warnings (new ones prefixed by >>):

   lib/int_sqrt.c: In function 'int_sqrt':
>> lib/int_sqrt.c:25: warning: 'm' is used uninitialized in this function

vim +/m +25 lib/int_sqrt.c

     9	#include <linux/export.h>
    10	
    11	/**
    12	 * int_sqrt - rough approximation to sqrt
    13	 * @x: integer of which to calculate the sqrt
    14	 *
    15	 * A very rough approximation to the sqrt() function.
    16	 */
    17	unsigned long int_sqrt(unsigned long x)
    18	{
    19		unsigned long b, m, y = 0;
    20	
    21		if (x <= 1)
    22			return x;
    23	
    24		if (x <= 0xffff) {
  > 25			if (m <= 0xff)
    26				m = 1UL << (8 - 2);
    27			else
    28				m = 1UL << (16 - 2);
    29		} else if (x <= 0xffffffff)
    30			m = 1UL << (32 - 2);
    31		else
    32			m = 1UL << (BITS_PER_LONG - 2);
    33		while (m != 0) {

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 12120 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
  2016-01-28 22:04 ` kbuild test robot
  2016-01-28 22:11 ` kbuild test robot
@ 2016-01-28 22:15 ` Joe Perches
  2016-01-28 22:40   ` Andi Kleen
  2016-01-28 22:22 ` Joe Perches
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Joe Perches @ 2016-01-28 22:15 UTC (permalink / raw)
  To: Andi Kleen, akpm
  Cc: linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb,
	Andi Kleen, anshul.g

(adding Anshul Garg)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (2 preceding siblings ...)
  2016-01-28 22:15 ` Joe Perches
@ 2016-01-28 22:22 ` Joe Perches
  2016-01-28 22:29 ` Eric Dumazet
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Joe Perches @ 2016-01-28 22:22 UTC (permalink / raw)
  To: Andi Kleen, akpm, Anshul Garg
  Cc: linux-kernel, rafael.j.wysocki, lenb, Andi Kleen, Davidlohr Bueso

(resending with email addresses that shouldn't bounce)
(adding Anshul Garg)
(fixed Davidlohr Bueso's address)

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

This thread might be relevant:

https://lkml.org/lkml/2015/2/2/600

and perhaps using fls might still be a good approach.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (3 preceding siblings ...)
  2016-01-28 22:22 ` Joe Perches
@ 2016-01-28 22:29 ` Eric Dumazet
  2016-01-28 22:30 ` Andi Kleen
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-01-28 22:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb, Andi Kleen

On Thu, 2016-01-28 at 13:42 -0800, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  lib/int_sqrt.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
> index 1ef4cc3..2479ccf 100644
> --- a/lib/int_sqrt.c
> +++ b/lib/int_sqrt.c
> @@ -21,7 +21,15 @@ unsigned long int_sqrt(unsigned long x)
>  	if (x <= 1)
>  		return x;

The above test (x <= 1) should also be moved

>  
> -	m = 1UL << (BITS_PER_LONG - 2);
> +	if (x <= 0xffff) {
> +		if (m <= 0xff)

m or x ?  if (x <= 0xff) looks more correct.

> +			m = 1UL << (8 - 2);
> +		else
> +			m = 1UL << (16 - 2);
> +	} else if (x <= 0xffffffff)
> +		m = 1UL << (32 - 2);
> +	else
> +		m = 1UL << (BITS_PER_LONG - 2);
>  	while (m != 0) {
>  		b = y + m;
>  		y >>= 1;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (4 preceding siblings ...)
  2016-01-28 22:29 ` Eric Dumazet
@ 2016-01-28 22:30 ` Andi Kleen
  2016-01-29  3:59 ` Rafael J. Wysocki
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-01-28 22:30 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb

Andi Kleen <andi@firstfloor.org> writes:

> From: Andi Kleen <ak@linux.intel.com>
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev

Added a stupid typo in the last minute. I'll post a new version.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 22:15 ` Joe Perches
@ 2016-01-28 22:40   ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-01-28 22:40 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andi Kleen, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb, Andi Kleen, anshul.g

> This thread might be relevant:
> 
> https://lkml.org/lkml/2015/2/2/600
> 
> and perhaps using fls might still be a good approach.

Linus wrote:


>>>
We *probably* have some argument range that we care more about, which
is why I'd like to know what the profile is that triggered this
optimization, and what the common argument range is.
<<<

That's exactly what I did. Used perf probe to get the common
range and optimize for that.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (5 preceding siblings ...)
  2016-01-28 22:30 ` Andi Kleen
@ 2016-01-29  3:59 ` Rafael J. Wysocki
  2016-01-31  7:27 ` Thomas Rohwer
  2016-02-01 21:25 ` Rasmus Villemoes
  8 siblings, 0 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2016-01-29  3:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb, Andi Kleen

On Thursday, January 28, 2016 01:42:45 PM Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
> 
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
> 
> I instrumented the function on my workstation and most values are
> 16bit only and most others 32bit (50% percentile is 122094,
> 75% is 3699533).
> 
> sqrt is implemented by starting with an initial estimation,
> and then iterating. int_sqrt currently only uses a fixed
> estimating which is good for 64bits worth of input.
> 
> This patch adds some checks at the beginning to start with
> a better estimate for values fitting in 8, 16bit and 32bit.
> This makes int_sqrt between 60+% faster for values in 16bit,
> and still somewhat faster (between 10 and 30%) for larger values
> upto 32bit. Full 64bit is slightly slower.
> 
> This optimizes the short idle calls and does not hurt the
> long sleep (which probably do not care) much.
> 
> An alternative would be a full table drive approach, or
> trying some inverted sqrt optimization, but this simple change
> already seems to have a good payoff.

I'm wondering if you have any numbers on how much of a difference this
makes in practice in terms of energy consumption, performance, latency etc.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (6 preceding siblings ...)
  2016-01-29  3:59 ` Rafael J. Wysocki
@ 2016-01-31  7:27 ` Thomas Rohwer
  2016-02-01 21:25 ` Rasmus Villemoes
  8 siblings, 0 replies; 22+ messages in thread
From: Thomas Rohwer @ 2016-01-31  7:27 UTC (permalink / raw)
  To: Andi Kleen, akpm
  Cc: linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb, Andi Kleen

Hello,

 > -	m = 1UL << (BITS_PER_LONG - 2);
 > +	if (x <= 0xffff) {
 > +		if (m <= 0xff)
 > +			m = 1UL << (8 - 2);
 > +		else
 > +			m = 1UL << (16 - 2);
 > +	} else if (x <= 0xffffffff)
 > +		m = 1UL << (32 - 2);
 > +	else
 > +		m = 1UL << (BITS_PER_LONG - 2);
 >   	while (m != 0) {
 >   		b = y + m;
 >   		y >>= 1;
 >


I think, m can be initialized with

1 << (greatest multiple of 2 less than or equal to (position of most significant bit of x))

i.e. 1 << ((position of most significant bit of x) & 62)

without changing the outcome of the original algorithm (as long as x<m the loop does just m >>= 2).

I believe, that for (position of most significant bit of x) there is an efficient macro, and
some processors directly have an instruction for it. So this would probably be faster than your suggestion
for an initial starting value and give an even better starting value (cutting in some cases further on the number of
while loop interations).

If one just wants to achieve a result with a certain relative error in terms of the fraction of the input, one can
probably only look at the most significant bit and a few following bits of x.

Sincerely,

Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
                   ` (7 preceding siblings ...)
  2016-01-31  7:27 ` Thomas Rohwer
@ 2016-02-01 21:25 ` Rasmus Villemoes
  2016-02-01 21:36   ` Andi Kleen
  8 siblings, 1 reply; 22+ messages in thread
From: Rasmus Villemoes @ 2016-02-01 21:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb, Andi Kleen

On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:

> From: Andi Kleen <ak@linux.intel.com>
>
> The menu cpuidle governor does at least two int_sqrt() each time
> we go into idle in get_typical_interval to compute stddev
>
> int_sqrts take 100-120 cycles each. Short idle latency is important
> for many workloads.
>

If you want to optimize get_typical_interval(), why not just take the
square root out of the equation (literally)?

Something like

From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Date: Mon, 1 Feb 2016 21:43:18 +0100
Subject: [PATCH] cpuidle: menu: avoid expensive square root computation

Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).

On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 drivers/cpuidle/governors/menu.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..beef7ae123ba 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -200,7 +200,7 @@ static void get_typical_interval(struct menu_device *data)
 {
 	int i, divisor;
 	unsigned int max, thresh;
-	uint64_t avg, stddev;
+	uint64_t avg, variance;
 
 	thresh = UINT_MAX; /* Discard outliers above this value */
 
@@ -224,36 +224,35 @@ again:
 	else
 		do_div(avg, divisor);
 
-	/* Then try to determine standard deviation */
-	stddev = 0;
+	/* Then try to determine variance */
+	variance = 0;
 	for (i = 0; i < INTERVALS; i++) {
 		unsigned int value = data->intervals[i];
 		if (value <= thresh) {
 			int64_t diff = value - avg;
-			stddev += diff * diff;
+			variance += diff * diff;
 		}
 	}
 	if (divisor == INTERVALS)
-		stddev >>= INTERVAL_SHIFT;
+		variance >>= INTERVAL_SHIFT;
 	else
-		do_div(stddev, divisor);
+		do_div(variance, divisor);
 
 	/*
-	 * The typical interval is obtained when standard deviation is small
-	 * or standard deviation is small compared to the average interval.
-	 *
-	 * int_sqrt() formal parameter type is unsigned long. When the
-	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
-	 * the resulting squared standard deviation exceeds the input domain
-	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
-	 * In such case reject the candidate average.
+	 * The typical interval is obtained when standard deviation is
+	 * small (stddev <= 20 us, variance <= 400 us^2) or standard
+	 * deviation is small compared to the average interval (avg >
+	 * 6*stddev, avg^2 > 36*variance). The average is smaller than
+	 * UINT_MAX aka U32_MAX, so computing its square does not
+	 * overflow a u64. We simply reject this candidate average if
+	 * the standard deviation is greater than 715 s (which is
+	 * rather unlikely).
 	 *
 	 * Use this result only if there is no timer to wake us up sooner.
 	 */
-	if (likely(stddev <= ULONG_MAX)) {
-		stddev = int_sqrt(stddev);
-		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
-							|| stddev <= 20) {
+	if (likely(variance <= U64_MAX/36)) {
+		if (((avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))
+							|| variance <= 400) {
 			if (data->next_timer_us > avg)
 				data->predicted_us = avg;
 			return;
-- 
2.6.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-01 21:25 ` Rasmus Villemoes
@ 2016-02-01 21:36   ` Andi Kleen
  2016-02-01 23:08     ` Rasmus Villemoes
  2016-02-07 21:32     ` Rasmus Villemoes
  0 siblings, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2016-02-01 21:36 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb

On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
> 
> > From: Andi Kleen <ak@linux.intel.com>
> >
> > The menu cpuidle governor does at least two int_sqrt() each time
> > we go into idle in get_typical_interval to compute stddev
> >
> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > for many workloads.
> >
> 
> If you want to optimize get_typical_interval(), why not just take the
> square root out of the equation (literally)?
> 
> Something like

Looks good. Yes that's a better fix.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-01 21:36   ` Andi Kleen
@ 2016-02-01 23:08     ` Rasmus Villemoes
  2016-02-02  0:00       ` Andi Kleen
  2016-02-02  0:36       ` Eric Dumazet
  2016-02-07 21:32     ` Rasmus Villemoes
  1 sibling, 2 replies; 22+ messages in thread
From: Rasmus Villemoes @ 2016-02-01 23:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb

On Mon, Feb 01 2016, Andi Kleen <ak@linux.intel.com> wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
>> 
>> > From: Andi Kleen <ak@linux.intel.com>
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Thanks. (Is there a good way to tell gcc that avg*avg is actually a
32x32->64 multiplication?)

While there and doing the math, I noticed that the variance computation
may _theoretically_ overflow (if half the observations are 0, half C,
the variance before the division should be around INTERVALS*C^2/4, which
is around 2^65 for C=UINT_MAX and INTERVALS=8). I have no idea if it
actually matters, but it can be fixed by lowering the initial threshold
from UINT_MAX to sqrt(4*U64_MAX/INTERVALS) ~~ 3e9. However, this would
make it possible that all observations are larger than the initial
threshold, so we'd have to protect against a division by zero...

Rasmus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-01 23:08     ` Rasmus Villemoes
@ 2016-02-02  0:00       ` Andi Kleen
  2016-02-02  0:36       ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-02-02  0:00 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, akpm, linux-kernel, davidlohr.bueso, rafael.j.wysocki, lenb

On Tue, Feb 02, 2016 at 12:08:46AM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen <ak@linux.intel.com> wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
> >> 
> >> > From: Andi Kleen <ak@linux.intel.com>
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

I don't think there is, but you could define a custom macro with a fallback
on pure 64x64->64.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-01 23:08     ` Rasmus Villemoes
  2016-02-02  0:00       ` Andi Kleen
@ 2016-02-02  0:36       ` Eric Dumazet
  2016-02-02 20:46         ` Rasmus Villemoes
  2017-07-20 10:10         ` Peter Zijlstra
  1 sibling, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-02-02  0:36 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, Andi Kleen, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb

On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:

> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> 32x32->64 multiplication?)

If avg is 32bit, compiler does that for you.

u32 avg = ...

u64 result = (u64)avg * avg;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-02  0:36       ` Eric Dumazet
@ 2016-02-02 20:46         ` Rasmus Villemoes
  2016-02-02 21:30           ` Eric Dumazet
  2017-07-20 10:10         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Rasmus Villemoes @ 2016-02-02 20:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Andi Kleen, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb

On Tue, Feb 02 2016, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
>
>> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
>> 32x32->64 multiplication?)
>
> If avg is 32bit, compiler does that for you.
>
> u32 avg = ...
>
> u64 result = (u64)avg * avg;

Yeah, but in this case avg is u64 because it is used to temporarily
contain the sum of a bunch of u32s, before being divided by #bunch. So
I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
readable :-/

I just thought the scenario of a u64 known to be holding a value < 2^32
was common enough that some utility macros already existed.

Rasmus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-02 20:46         ` Rasmus Villemoes
@ 2016-02-02 21:30           ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2016-02-02 21:30 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, Andi Kleen, akpm, linux-kernel, davidlohr.bueso,
	rafael.j.wysocki, lenb

On Tue, 2016-02-02 at 21:46 +0100, Rasmus Villemoes wrote:
> On Tue, Feb 02 2016, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> >
> >> Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> >> 32x32->64 multiplication?)
> >
> > If avg is 32bit, compiler does that for you.
> >
> > u32 avg = ...
> >
> > u64 result = (u64)avg * avg;
> 
> Yeah, but in this case avg is u64 because it is used to temporarily
> contain the sum of a bunch of u32s, before being divided by #bunch. So
> I'd have to write that as (u64)(u32)avg * (u32)avg, which isn't very
> readable :-/
> 
> I just thought the scenario of a u64 known to be holding a value < 2^32
> was common enough that some utility macros already existed.
> 
> Rasmus

crypto/vmac.c has this, you could make it generic maybe.

#define MUL32(i1, i2)   ((u64)(u32)(i1)*(u32)(i2))

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-01 21:36   ` Andi Kleen
  2016-02-01 23:08     ` Rasmus Villemoes
@ 2016-02-07 21:32     ` Rasmus Villemoes
  2016-02-09 20:44       ` Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: Rasmus Villemoes @ 2016-02-07 21:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, akpm, linux-kernel, rafael.j.wysocki, lenb

On Mon, Feb 01 2016, Andi Kleen <ak@linux.intel.com> wrote:

> On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
>> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
>> 
>> > From: Andi Kleen <ak@linux.intel.com>
>> >
>> > The menu cpuidle governor does at least two int_sqrt() each time
>> > we go into idle in get_typical_interval to compute stddev
>> >
>> > int_sqrts take 100-120 cycles each. Short idle latency is important
>> > for many workloads.
>> >
>> 
>> If you want to optimize get_typical_interval(), why not just take the
>> square root out of the equation (literally)?
>> 
>> Something like
>
> Looks good. Yes that's a better fix.
>

Andi, did you have a way to measure the impact, and if so, could I get
you to run the numbers again with my patch?

Thanks,
Rasmus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-07 21:32     ` Rasmus Villemoes
@ 2016-02-09 20:44       ` Andi Kleen
  2016-02-10 13:31         ` Fengguang Wu
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-02-09 20:44 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, akpm, linux-kernel, rafael.j.wysocki, lenb, fengguang.wu

On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> On Mon, Feb 01 2016, Andi Kleen <ak@linux.intel.com> wrote:
> 
> > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> >> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
> >> 
> >> > From: Andi Kleen <ak@linux.intel.com>
> >> >
> >> > The menu cpuidle governor does at least two int_sqrt() each time
> >> > we go into idle in get_typical_interval to compute stddev
> >> >
> >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> >> > for many workloads.
> >> >
> >> 
> >> If you want to optimize get_typical_interval(), why not just take the
> >> square root out of the equation (literally)?
> >> 
> >> Something like
> >
> > Looks good. Yes that's a better fix.
> >
> 
> Andi, did you have a way to measure the impact, and if so, could I get
> you to run the numbers again with my patch?

I got the numbers from the 0day runs (AIM7 gets faster)
In theory if you post the patch that should happen automatically
(checking with Fengguang)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-09 20:44       ` Andi Kleen
@ 2016-02-10 13:31         ` Fengguang Wu
  0 siblings, 0 replies; 22+ messages in thread
From: Fengguang Wu @ 2016-02-10 13:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rasmus Villemoes, Andi Kleen, akpm, linux-kernel, rafael.j.wysocki, lenb

On Tue, Feb 09, 2016 at 12:44:00PM -0800, Andi Kleen wrote:
> On Sun, Feb 07, 2016 at 10:32:26PM +0100, Rasmus Villemoes wrote:
> > On Mon, Feb 01 2016, Andi Kleen <ak@linux.intel.com> wrote:
> > 
> > > On Mon, Feb 01, 2016 at 10:25:17PM +0100, Rasmus Villemoes wrote:
> > >> On Thu, Jan 28 2016, Andi Kleen <andi@firstfloor.org> wrote:
> > >> 
> > >> > From: Andi Kleen <ak@linux.intel.com>
> > >> >
> > >> > The menu cpuidle governor does at least two int_sqrt() each time
> > >> > we go into idle in get_typical_interval to compute stddev
> > >> >
> > >> > int_sqrts take 100-120 cycles each. Short idle latency is important
> > >> > for many workloads.
> > >> >
> > >> 
> > >> If you want to optimize get_typical_interval(), why not just take the
> > >> square root out of the equation (literally)?
> > >> 
> > >> Something like
> > >
> > > Looks good. Yes that's a better fix.
> > >
> > 
> > Andi, did you have a way to measure the impact, and if so, could I get
> > you to run the numbers again with my patch?
> 
> I got the numbers from the 0day runs (AIM7 gets faster)
> In theory if you post the patch that should happen automatically
> (checking with Fengguang)

Yes we bisect and report a lot of runtime performance changes.
On the other hand, some few will be missed, too, due to various
unstableness issues. Anyway if you would like to study performance
impacts of a patch, feel free to send us requests to gather the
numbers and do comparison.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2016-02-02  0:36       ` Eric Dumazet
  2016-02-02 20:46         ` Rasmus Villemoes
@ 2017-07-20 10:10         ` Peter Zijlstra
  2017-07-24 13:28           ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2017-07-20 10:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rasmus Villemoes, Andi Kleen, Andi Kleen, akpm, linux-kernel,
	davidlohr.bueso, rafael.j.wysocki, lenb

On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> 
> > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > 32x32->64 multiplication?)
> 
> If avg is 32bit, compiler does that for you.
> 
> u32 avg = ...
> 
> u64 result = (u64)avg * avg;

It does not in fact do that :/ See commit:

  9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Optimize int_sqrt for small values for faster idle
  2017-07-20 10:10         ` Peter Zijlstra
@ 2017-07-24 13:28           ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2017-07-24 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rasmus Villemoes, Andi Kleen, Andi Kleen, akpm, linux-kernel,
	davidlohr.bueso, rafael.j.wysocki, lenb

On Thu, 2017-07-20 at 12:10 +0200, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 04:36:38PM -0800, Eric Dumazet wrote:
> > On Tue, 2016-02-02 at 00:08 +0100, Rasmus Villemoes wrote:
> > 
> > > Thanks. (Is there a good way to tell gcc that avg*avg is actually a
> > > 32x32->64 multiplication?)
> > 
> > If avg is 32bit, compiler does that for you.
> > 
> > u32 avg = ...
> > 
> > u64 result = (u64)avg * avg;
> 
> It does not in fact do that :/ See commit:
> 
>   9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")

Interesting :/

Thanks for letting me know !

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-07-24 13:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-28 21:42 [PATCH] Optimize int_sqrt for small values for faster idle Andi Kleen
2016-01-28 22:04 ` kbuild test robot
2016-01-28 22:11 ` kbuild test robot
2016-01-28 22:15 ` Joe Perches
2016-01-28 22:40   ` Andi Kleen
2016-01-28 22:22 ` Joe Perches
2016-01-28 22:29 ` Eric Dumazet
2016-01-28 22:30 ` Andi Kleen
2016-01-29  3:59 ` Rafael J. Wysocki
2016-01-31  7:27 ` Thomas Rohwer
2016-02-01 21:25 ` Rasmus Villemoes
2016-02-01 21:36   ` Andi Kleen
2016-02-01 23:08     ` Rasmus Villemoes
2016-02-02  0:00       ` Andi Kleen
2016-02-02  0:36       ` Eric Dumazet
2016-02-02 20:46         ` Rasmus Villemoes
2016-02-02 21:30           ` Eric Dumazet
2017-07-20 10:10         ` Peter Zijlstra
2017-07-24 13:28           ` Eric Dumazet
2016-02-07 21:32     ` Rasmus Villemoes
2016-02-09 20:44       ` Andi Kleen
2016-02-10 13:31         ` Fengguang Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.