linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] optimize ktime_divns for constant divisors
@ 2014-12-03 19:43 Nicolas Pitre
  2014-12-03 20:03 ` Arnd Bergmann
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-03 19:43 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz; +Cc: linux-arm-kernel, linux-kernel

At least on ARM, do_div() is optimized to turn constant divisors into
an inline multiplication by the reciprocal value at compile time. 
However this optimization is missed entirely whenever ktime_divns() is
used and the slow out-of-line division code is used all the time.

Let ktime_divns() use do_div() inline whenever the divisor is constant
and small enough.  This will make things like ktime_to_us() and 
ktime_to_ms() much faster.

Signed-off-by: Nicolas Pitre <nico@linaro.org>

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index c9d645ad98..411dd8bfe5 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -166,7 +166,17 @@ static inline bool ktime_before(const ktime_t cmp1, const ktime_t cmp2)
 }
 
 #if BITS_PER_LONG < 64
-extern u64 ktime_divns(const ktime_t kt, s64 div);
+extern u64 __ktime_divns(const ktime_t kt, s64 div);
+static inline u64 ktime_divns(const ktime_t kt, s64 div)
+{
+	if (__builtin_constant_p(div) && !(div >> 32)) {
+		u64 ns = kt.tv64;
+		do_div(ns, div);
+		return ns;
+	} else {
+		return __ktime_divns(kt, div);
+	}
+}
 #else /* BITS_PER_LONG < 64 */
 # define ktime_divns(kt, div)		(u64)((kt).tv64 / (div))
 #endif
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 37e50aadd4..890535c41c 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -266,7 +266,7 @@ lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 /*
  * Divide a ktime value by a nanosecond value
  */
-u64 ktime_divns(const ktime_t kt, s64 div)
+u64 __ktime_divns(const ktime_t kt, s64 div)
 {
 	u64 dclc;
 	int sft = 0;
@@ -282,7 +282,7 @@ u64 ktime_divns(const ktime_t kt, s64 div)
 
 	return dclc;
 }
-EXPORT_SYMBOL_GPL(ktime_divns);
+EXPORT_SYMBOL_GPL(__ktime_divns);
 #endif /* BITS_PER_LONG >= 64 */
 
 /*

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-03 19:43 [PATCH] optimize ktime_divns for constant divisors Nicolas Pitre
@ 2014-12-03 20:03 ` Arnd Bergmann
  2014-12-04  7:23   ` Nicolas Pitre
  2014-12-03 20:16 ` Robert Jarzmik
  2014-12-05 18:00 ` Nicolas Pitre
  2 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2014-12-03 20:03 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nicolas Pitre, Thomas Gleixner, John Stultz, linux-kernel

On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> At least on ARM, do_div() is optimized to turn constant divisors into
> an inline multiplication by the reciprocal value at compile time. 
> However this optimization is missed entirely whenever ktime_divns() is
> used and the slow out-of-line division code is used all the time.
> 
> Let ktime_divns() use do_div() inline whenever the divisor is constant
> and small enough.  This will make things like ktime_to_us() and 
> ktime_to_ms() much faster.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>

Very cool. I've been thinking about doing something similar for the
general case but couldn't get the math to work.

Can you think of an architecture-independent way to ktime_to_sec,
ktime_to_ms, and ktime_to_us efficiently based on what you did for
the ARM do_div implementation?

	Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-03 19:43 [PATCH] optimize ktime_divns for constant divisors Nicolas Pitre
  2014-12-03 20:03 ` Arnd Bergmann
@ 2014-12-03 20:16 ` Robert Jarzmik
  2014-12-03 20:37   ` Nicolas Pitre
  2014-12-05 18:00 ` Nicolas Pitre
  2 siblings, 1 reply; 15+ messages in thread
From: Robert Jarzmik @ 2014-12-03 20:16 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Thomas Gleixner, John Stultz, linux-kernel, linux-arm-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> Let ktime_divns() use do_div() inline whenever the divisor is constant
> and small enough.  This will make things like ktime_to_us() and 
> ktime_to_ms() much faster.

Hi Nicolas,

I suppose the "small enough" is linked to the "!(div >> 32)" in your patch.  Can
I have the rationale which brought up this value, and if that value is universal
across architectures (ie. x86/ppc/arm/...) ?

And when you say "much faster", do you have figures to add to your commit
message ?

Cheers.

--
Robert

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-03 20:16 ` Robert Jarzmik
@ 2014-12-03 20:37   ` Nicolas Pitre
  0 siblings, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-03 20:37 UTC (permalink / raw)
  To: Robert Jarzmik
  Cc: Thomas Gleixner, John Stultz, linux-kernel, linux-arm-kernel

On Wed, 3 Dec 2014, Robert Jarzmik wrote:

> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> 
> > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > and small enough.  This will make things like ktime_to_us() and 
> > ktime_to_ms() much faster.
> 
> Hi Nicolas,
> 
> I suppose the "small enough" is linked to the "!(div >> 32)" in your patch.  Can
> I have the rationale which brought up this value, and if that value is universal
> across architectures (ie. x86/ppc/arm/...) ?

Yes.  The do_div() function is defined to accept a 32-bit divisor only.  
The out-of-line ktime_divns code does scale down both the dividend and 
the divisor until the divisor is within 32 bits of magnitude before 
calling do_div().  However the constness of the divisor is lost and the 
optimised do_div (on ARM at least) doesn't get involved.

> And when you say "much faster", do you have figures to add to your commit
> message ?

No actual figure.  But a wild guess would be around an order of 
magnitude.  See commit fa4adc6149 for an example of generated code.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-03 20:03 ` Arnd Bergmann
@ 2014-12-04  7:23   ` Nicolas Pitre
  2014-12-04 12:02     ` Arnd Bergmann
       [not found]     ` <OF0EDEDB1C.C03829F7-ON48257DA5.00062083-48257DA5.0007628B@zte.com.cn>
  0 siblings, 2 replies; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-04  7:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Thomas Gleixner, John Stultz, linux-kernel

On Wed, 3 Dec 2014, Arnd Bergmann wrote:

> On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> > At least on ARM, do_div() is optimized to turn constant divisors into
> > an inline multiplication by the reciprocal value at compile time. 
> > However this optimization is missed entirely whenever ktime_divns() is
> > used and the slow out-of-line division code is used all the time.
> > 
> > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > and small enough.  This will make things like ktime_to_us() and 
> > ktime_to_ms() much faster.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> 
> Very cool. I've been thinking about doing something similar for the
> general case but couldn't get the math to work.
> 
> Can you think of an architecture-independent way to ktime_to_sec,
> ktime_to_ms, and ktime_to_us efficiently based on what you did for
> the ARM do_div implementation?

Sure.  gcc generates rather shitty code on ARM compared to the output 
from my do_div() implementation. But here it is:

u64 ktime_to_us(ktime_t kt)
{
	u64 ns = ktime_to_ns(kt);
	u32 x_lo, x_hi, y_lo, y_hi;
	u64 res, carry;

	x_hi = ns >> 32;
	x_lo = ns;
	y_hi = 0x83126e97;
	y_lo = 0x8d4fdf3b;

	res = (u64)x_lo * y_lo;
	carry = (u64)(u32)res + y_lo;
	res = (res >> 32) + (carry >> 32);

	res += (u64)x_lo * y_hi;
	carry = (u64)(u32)res + (u64)x_hi * y_lo;
	res = (res >> 32) + (carry >> 32);

	res += (u64)x_hi * y_hi;
	return res >> 9;
}

For ktime_to_ms() the constants would be as follows:

	y_hi = 0x8637bd05;
	y_lo = 0xaf6c69b5;
	final shift = 19

For ktime_to_sec() that would be:

	y_hi = 0x89705f41;
	y_lo = 0x36b4a597;
	final shift = 29


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-04  7:23   ` Nicolas Pitre
@ 2014-12-04 12:02     ` Arnd Bergmann
  2014-12-04 13:46       ` Nicolas Pitre
       [not found]     ` <OF0EDEDB1C.C03829F7-ON48257DA5.00062083-48257DA5.0007628B@zte.com.cn>
  1 sibling, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2014-12-04 12:02 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-arm-kernel, Thomas Gleixner, John Stultz, linux-kernel

On Thursday 04 December 2014 02:23:37 Nicolas Pitre wrote:
> On Wed, 3 Dec 2014, Arnd Bergmann wrote:
> 
> > On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> > > At least on ARM, do_div() is optimized to turn constant divisors into
> > > an inline multiplication by the reciprocal value at compile time. 
> > > However this optimization is missed entirely whenever ktime_divns() is
> > > used and the slow out-of-line division code is used all the time.
> > > 
> > > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > > and small enough.  This will make things like ktime_to_us() and 
> > > ktime_to_ms() much faster.
> > > 
> > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > 
> > Very cool. I've been thinking about doing something similar for the
> > general case but couldn't get the math to work.
> > 
> > Can you think of an architecture-independent way to ktime_to_sec,
> > ktime_to_ms, and ktime_to_us efficiently based on what you did for
> > the ARM do_div implementation?
> 
> Sure.  gcc generates rather shitty code on ARM compared to the output 
> from my do_div() implementation. But here it is:
> 
> u64 ktime_to_us(ktime_t kt)
> {
>         u64 ns = ktime_to_ns(kt);
>         u32 x_lo, x_hi, y_lo, y_hi;
>         u64 res, carry;
> 
>         x_hi = ns >> 32;
>         x_lo = ns;
>         y_hi = 0x83126e97;
>         y_lo = 0x8d4fdf3b;
> 
>         res = (u64)x_lo * y_lo;
>         carry = (u64)(u32)res + y_lo;
>         res = (res >> 32) + (carry >> 32);
> 
>         res += (u64)x_lo * y_hi;
>         carry = (u64)(u32)res + (u64)x_hi * y_lo;
>         res = (res >> 32) + (carry >> 32);
> 
>         res += (u64)x_hi * y_hi;
>         return res >> 9;
> }

Ok, I see, thanks for the example. I also tried this on x86, and it takes
about twice as long as do_div on my Opteron, so it wouldn't be as helpful
as I hoped.

On a related note, I wonder if we can come up with a more efficient
implementation for do_div on ARMv7ve, and I think we should add the
Makefile logic to build with -march=armv7ve when we know that we do
not need to support processors without idiv.

	Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-04 12:02     ` Arnd Bergmann
@ 2014-12-04 13:46       ` Nicolas Pitre
  2014-12-04 14:56         ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-04 13:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Thomas Gleixner, John Stultz, linux-kernel

On Thu, 4 Dec 2014, Arnd Bergmann wrote:

> On Thursday 04 December 2014 02:23:37 Nicolas Pitre wrote:
> > On Wed, 3 Dec 2014, Arnd Bergmann wrote:
> > 
> > > On Wednesday 03 December 2014 14:43:06 Nicolas Pitre wrote:
> > > > At least on ARM, do_div() is optimized to turn constant divisors into
> > > > an inline multiplication by the reciprocal value at compile time. 
> > > > However this optimization is missed entirely whenever ktime_divns() is
> > > > used and the slow out-of-line division code is used all the time.
> > > > 
> > > > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > > > and small enough.  This will make things like ktime_to_us() and 
> > > > ktime_to_ms() much faster.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > 
> > > Very cool. I've been thinking about doing something similar for the
> > > general case but couldn't get the math to work.
> > > 
> > > Can you think of an architecture-independent way to ktime_to_sec,
> > > ktime_to_ms, and ktime_to_us efficiently based on what you did for
> > > the ARM do_div implementation?
> > 
> > Sure.  gcc generates rather shitty code on ARM compared to the output 
> > from my do_div() implementation. But here it is:
> > 
> > u64 ktime_to_us(ktime_t kt)
> > {
> >         u64 ns = ktime_to_ns(kt);
> >         u32 x_lo, x_hi, y_lo, y_hi;
> >         u64 res, carry;
> > 
> >         x_hi = ns >> 32;
> >         x_lo = ns;
> >         y_hi = 0x83126e97;
> >         y_lo = 0x8d4fdf3b;
> > 
> >         res = (u64)x_lo * y_lo;
> >         carry = (u64)(u32)res + y_lo;
> >         res = (res >> 32) + (carry >> 32);
> > 
> >         res += (u64)x_lo * y_hi;
> >         carry = (u64)(u32)res + (u64)x_hi * y_lo;
> >         res = (res >> 32) + (carry >> 32);
> > 
> >         res += (u64)x_hi * y_hi;
> >         return res >> 9;
> > }
> 
> Ok, I see, thanks for the example. I also tried this on x86, and it takes
> about twice as long as do_div on my Opteron, so it wouldn't be as helpful
> as I hoped.

Note the above code is for 32-bit architectures that support a 32x32=64 
bit multiply instruction.  And even then, what kills performances is the 
inhability to efficiently deal with carry bits from C code.  Hence the 
far better output from do_div() on ARM.

If x86-64 has a 64x64=128 bit multiply instruction then the above may 
greatly be simplified to a single multiply and a shift.  That would 
possibly outperform do_div().

> On a related note, I wonder if we can come up with a more efficient
> implementation for do_div on ARMv7ve, and I think we should add the
> Makefile logic to build with -march=armv7ve when we know that we do
> not need to support processors without idiv.

Multiplications will always be faster than divisions.  However the idiv 
instruction would come very handy in the slow path when the divisor is 
not constant.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-04 13:46       ` Nicolas Pitre
@ 2014-12-04 14:56         ` Arnd Bergmann
  2014-12-04 16:47           ` Nicolas Pitre
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2014-12-04 14:56 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-arm-kernel, Thomas Gleixner, John Stultz, linux-kernel

On Thursday 04 December 2014 08:46:27 Nicolas Pitre wrote:
> On Thu, 4 Dec 2014, Arnd Bergmann wrote:
> Note the above code is for 32-bit architectures that support a 32x32=64 
> bit multiply instruction.  And even then, what kills performances is the 
> inhability to efficiently deal with carry bits from C code.  Hence the 
> far better output from do_div() on ARM.
> 
> If x86-64 has a 64x64=128 bit multiply instruction then the above may 
> greatly be simplified to a single multiply and a shift.  That would 
> possibly outperform do_div().

I was trying this in 32-bit mode to see how it would work in x86-32
kernels. Since that architecture has a 64-by-32 divide instruction,
that gets used here.

x86-64 has a 64x64=128 multiply instruction and gcc uses that for
any 64-bit division by constant, so that's what already happens
in do_div. I assume for any 64-bit architecture, the result will
be similar.

I guess the only architectures that would benefit from your implementation
above are the ones that do not have any optimization for constant
64-by-32-bit division and just call do_div.

> > On a related note, I wonder if we can come up with a more efficient
> > implementation for do_div on ARMv7ve, and I think we should add the
> > Makefile logic to build with -march=armv7ve when we know that we do
> > not need to support processors without idiv.
> 
> Multiplications will always be faster than divisions.  However the idiv 
> instruction would come very handy in the slow path when the divisor is 
> not constant.

Makes sense. I also just checked the gcc sources and it seems that the
idiv/udiv instructions on ARM are not even used for implementing
__aeabi_uldivmod there. Not sure if that's intentional, but we probably
don't need to bother optimizing this in the kernel before user space
does. Building with -march=armv7ve still sounds helpful to avoid the
__aeabi_uidiv calls though.

	Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-04 14:56         ` Arnd Bergmann
@ 2014-12-04 16:47           ` Nicolas Pitre
  0 siblings, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-04 16:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Thomas Gleixner, John Stultz, linux-kernel

On Thu, 4 Dec 2014, Arnd Bergmann wrote:

> On Thursday 04 December 2014 08:46:27 Nicolas Pitre wrote:
> > On Thu, 4 Dec 2014, Arnd Bergmann wrote:
> > Note the above code is for 32-bit architectures that support a 32x32=64 
> > bit multiply instruction.  And even then, what kills performances is the 
> > inhability to efficiently deal with carry bits from C code.  Hence the 
> > far better output from do_div() on ARM.
> > 
> > If x86-64 has a 64x64=128 bit multiply instruction then the above may 
> > greatly be simplified to a single multiply and a shift.  That would 
> > possibly outperform do_div().
> 
> I was trying this in 32-bit mode to see how it would work in x86-32
> kernels. Since that architecture has a 64-by-32 divide instruction,
> that gets used here.
> 
> x86-64 has a 64x64=128 multiply instruction and gcc uses that for
> any 64-bit division by constant, so that's what already happens
> in do_div. I assume for any 64-bit architecture, the result will
> be similar.

OK.  In that case x86-64 will also benefit from the patch at the 
beginning of this thread.

> I guess the only architectures that would benefit from your implementation
> above are the ones that do not have any optimization for constant
> 64-by-32-bit division and just call do_div.

And then it would be best to optimize do_div() directly so all users 
would benefit.

> > > On a related note, I wonder if we can come up with a more efficient
> > > implementation for do_div on ARMv7ve, and I think we should add the
> > > Makefile logic to build with -march=armv7ve when we know that we do
> > > not need to support processors without idiv.
> > 
> > Multiplications will always be faster than divisions.  However the idiv 
> > instruction would come very handy in the slow path when the divisor is 
> > not constant.
> 
> Makes sense. I also just checked the gcc sources and it seems that the
> idiv/udiv instructions on ARM are not even used for implementing
> __aeabi_uldivmod there. Not sure if that's intentional, but we probably
> don't need to bother optimizing this in the kernel before user space
> does.

I wouldn't say so.  There are many precedents where we optimized those 
things in the kernel before gcc caught up.  In a few cases I contributed 
the same optimized arithmetic routines to both gcc and the kernel.

> Building with -march=armv7ve still sounds helpful to avoid the
> __aeabi_uidiv calls though.

Yep.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] optimize ktime_divns for constant divisors
       [not found]     ` <OF0EDEDB1C.C03829F7-ON48257DA5.00062083-48257DA5.0007628B@zte.com.cn>
@ 2014-12-05  4:30       ` Nicolas Pitre
  2014-12-05 10:08         ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-05  4:30 UTC (permalink / raw)
  To: pang.xunlei
  Cc: Arnd Bergmann, John Stultz, linux-arm-kernel, linux-kernel,
	linux-kernel-owner, Thomas Gleixner

On Fri, 5 Dec 2014, pang.xunlei@zte.com.cn wrote:

> Nicolas,
> 
> On Thursday 04 December 2014 15:23:37: Nicolas Pitre wrote:
> > Nicolas Pitre <nicolas.pitre@linaro.org> 
> > 
> > u64 ktime_to_us(ktime_t kt)
> > {
> >    u64 ns = ktime_to_ns(kt);
> >    u32 x_lo, x_hi, y_lo, y_hi;
> >    u64 res, carry;
> > 
> >    x_hi = ns >> 32;
> >    x_lo = ns;
> >    y_hi = 0x83126e97;
> >    y_lo = 0x8d4fdf3b;
> > 
> >    res = (u64)x_lo * y_lo;
> >    carry = (u64)(u32)res + y_lo;
> >    res = (res >> 32) + (carry >> 32);
> > 
> >    res += (u64)x_lo * y_hi;
> >    carry = (u64)(u32)res + (u64)x_hi * y_lo;
> >    res = (res >> 32) + (carry >> 32);
> > 
> >    res += (u64)x_hi * y_hi;
> >    return res >> 9;
> > }
> 
> What's the first carry operation for?

Hmmm... OK there is a bug.

The code should actually be:

	res = (u64)x_lo * y_lo;
	carry = (u64)(u32)res + y_lo;
	res = (res >> 32) + (carry >> 32) + y_hi;

(the addition of y_hi was missing on the third line of the above block)

The equation is: res = (y + x*y) >> 9

> Moreover, I think the carry operations can be omitted, like below:
> u64 ktime_to_us(ktime_t kt)
> {
>          u64 ns = ktime_to_ns(kt);
>          u32 x_lo, x_hi, y_lo, y_hi;
>          u64 res;
> 
>          x_hi = ns >> 32;
>          x_lo = ns;
>          y_hi = 0x83126e97;
>          y_lo = 0x8d4fdf3b;
> 
>          res = (u64)x_lo * y_lo;
>          res = (res >> 32);

See above. y must be added to res before shifting, and that may cause an 
overflow.

>          res += (u64)x_lo * y_hi + (u64)x_hi * y_lo;

That, too, risk overflowing.

Let's say x_lo = 0xffffffff and x_hi = 0xffffffff.  You get:

	0xffffffff * 0x83126e97 ->  0x83126e967ced9169
	0xffffffff * 0x8d4fdf3b ->  0x8d4fdf3a72b020c5
	                           -------------------
	                           0x110624dd0ef9db22e

Therefore the sum doesn't fit into a u64 variable.

It is possible to skip carry handling but only when the MSB of both 
constants are zero.  Here it is not the case.

>          res = (res >> 32);
> 
>          res += (u64)x_hi * y_hi;
> 
>          return res >> 9;
> }
> 
> Also, I ran this code using ktime "122500000000", and it results as
> 122499999 due to the y_lo deviation,

Please see bug fix above.

> maybe can use 0x8d4fdf3c instead?

No, that won't work with 0xfffffffffffffd97 for example.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-05  4:30       ` Nicolas Pitre
@ 2014-12-05 10:08         ` Arnd Bergmann
  2014-12-05 17:15           ` Nicolas Pitre
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2014-12-05 10:08 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nicolas Pitre, pang.xunlei, linux-kernel-owner, linux-kernel,
	John Stultz, Thomas Gleixner

On Thursday 04 December 2014 23:30:08 Nicolas Pitre wrote:
> >          res += (u64)x_lo * y_hi + (u64)x_hi * y_lo;
> 
> That, too, risk overflowing.
> 
> Let's say x_lo = 0xffffffff and x_hi = 0xffffffff.  You get:
> 
>         0xffffffff * 0x83126e97 ->  0x83126e967ced9169
>         0xffffffff * 0x8d4fdf3b ->  0x8d4fdf3a72b020c5
>                                    -------------------
>                                    0x110624dd0ef9db22e
> 
> Therefore the sum doesn't fit into a u64 variable.
> 
> It is possible to skip carry handling but only when the MSB of both 
> constants are zero.  Here it is not the case.

If I understand this right, there are two possible optimizations to
avoid the overflow:

- for anything using monotonic time, or elapsed time, we can guarantee
  that the upper bits are zero. Relying on monotonic time is a bit
  dangerous, because that would mean introducing an API that works
  with ktime_get() but not ktime_get_real(), and risk introducing
  subtle bugs.
  However, ktime_us_delta() could be optimized, and we can introduce
  similar ktime_sec_delta() and ktime_ms_delta() functions with
  the same properties.

- one could always pre-shift the ktime_t value. For a division by
  1000, we can shift right by 3 bits first, then do the multiplication
  and then do another shift. Not sure if that helps at all or if
  the extra shift operation makes this counterproductive.

	Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-05 10:08         ` Arnd Bergmann
@ 2014-12-05 17:15           ` Nicolas Pitre
  0 siblings, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-05 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, pang.xunlei, linux-kernel-owner, linux-kernel,
	John Stultz, Thomas Gleixner

On Fri, 5 Dec 2014, Arnd Bergmann wrote:


> > 
> > That, too, risk overflowing.
> > 
> > Let's say x_lo = 0xffffffff and x_hi = 0xffffffff.  You get:
> > 
> >         0xffffffff * 0x83126e97 ->  0x83126e967ced9169
> >         0xffffffff * 0x8d4fdf3b ->  0x8d4fdf3a72b020c5
> >                                    -------------------
> >                                    0x110624dd0ef9db22e
> > 
> > Therefore the sum doesn't fit into a u64 variable.
> > 
> > It is possible to skip carry handling but only when the MSB of both 
> > constants are zero.  Here it is not the case.
> 
> If I understand this right, there are two possible optimizations to
> avoid the overflow:
> 
> - for anything using monotonic time, or elapsed time, we can guarantee
>   that the upper bits are zero. Relying on monotonic time is a bit
>   dangerous, because that would mean introducing an API that works
>   with ktime_get() but not ktime_get_real(), and risk introducing
>   subtle bugs.
>   However, ktime_us_delta() could be optimized, and we can introduce
>   similar ktime_sec_delta() and ktime_ms_delta() functions with
>   the same properties.

Well, as Pang mentioned, ktime_t.tv64 is signed.  So if a negative value 
were to be passed to the current version of ktime_divns() you wouldn't 
get the expected answer as the first thing it does is

	u64 dclc = ktime_to_ns(kt);

And do_div() works with unsigned values.

So to say that we can assume that currently and for the forseeable 
future, the top bit of ktime_t will never be set.  And if it does due to 
a negative value then the code is already buggy.

With that assumption in mind, we now have a maximum value of 
0x7fffffffffffffff to divide i.e. 63 rather than 64 bits.  That means we 
don't need the initial bias anymore to get correct results.  And the 
constant looses its MSB too, removing the possibility for overflows in 
the cross product.

Therefore the code becomes:

u64 ktime_to_us(ktime_t kt)
{
        u64 ns = ktime_to_ns(kt);
        u32 x_lo, x_hi, y_lo, y_hi;
        u64 res;

        x_hi = ns >> 32;
        x_lo = ns;
        y_hi = 0x4189374b;
        y_lo = 0xc6a7ef9e;

        res =  (u64)x_lo * y_lo;
        res >>= 32;
        res += (u64)x_lo * y_hi;
        res += (u64)x_hi * y_lo;
        res >>= 32;
        res += (u64)x_hi * y_hi;

        return res >> 8;
}

This is probably the best that can be done portably.

> - one could always pre-shift the ktime_t value. For a division by
>   1000, we can shift right by 3 bits first, then do the multiplication
>   and then do another shift. Not sure if that helps at all or if
>   the extra shift operation makes this counterproductive.

It could help in the full 64-bit range case as the remaining dividend 
doesn't require a full 64-bit reciprocal constant, avoiding once again 
the need for the initial bias and the carry handling.  Depending on the 
actual reciprocal bit pattern this may not always be necessary.  It also 
depends how cheap shifting a 64-bit value is (on ARM this requires 3 
instructions and 3 registers).

But in the specific case above this provides no gain.


Nicolas


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-03 19:43 [PATCH] optimize ktime_divns for constant divisors Nicolas Pitre
  2014-12-03 20:03 ` Arnd Bergmann
  2014-12-03 20:16 ` Robert Jarzmik
@ 2014-12-05 18:00 ` Nicolas Pitre
  2014-12-05 21:03   ` Arnd Bergmann
  2 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2014-12-05 18:00 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz; +Cc: linux-arm-kernel, linux-kernel

BTW this is worth applying despite the on-going discussion with Arnd 
on a separate optimization.

On Wed, 3 Dec 2014, Nicolas Pitre wrote:

> At least on ARM, do_div() is optimized to turn constant divisors into
> an inline multiplication by the reciprocal value at compile time. 
> However this optimization is missed entirely whenever ktime_divns() is
> used and the slow out-of-line division code is used all the time.
> 
> Let ktime_divns() use do_div() inline whenever the divisor is constant
> and small enough.  This will make things like ktime_to_us() and 
> ktime_to_ms() much faster.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> 
> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
> index c9d645ad98..411dd8bfe5 100644
> --- a/include/linux/ktime.h
> +++ b/include/linux/ktime.h
> @@ -166,7 +166,17 @@ static inline bool ktime_before(const ktime_t cmp1, const ktime_t cmp2)
>  }
>  
>  #if BITS_PER_LONG < 64
> -extern u64 ktime_divns(const ktime_t kt, s64 div);
> +extern u64 __ktime_divns(const ktime_t kt, s64 div);
> +static inline u64 ktime_divns(const ktime_t kt, s64 div)
> +{
> +	if (__builtin_constant_p(div) && !(div >> 32)) {
> +		u64 ns = kt.tv64;
> +		do_div(ns, div);
> +		return ns;
> +	} else {
> +		return __ktime_divns(kt, div);
> +	}
> +}
>  #else /* BITS_PER_LONG < 64 */
>  # define ktime_divns(kt, div)		(u64)((kt).tv64 / (div))
>  #endif
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 37e50aadd4..890535c41c 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -266,7 +266,7 @@ lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
>  /*
>   * Divide a ktime value by a nanosecond value
>   */
> -u64 ktime_divns(const ktime_t kt, s64 div)
> +u64 __ktime_divns(const ktime_t kt, s64 div)
>  {
>  	u64 dclc;
>  	int sft = 0;
> @@ -282,7 +282,7 @@ u64 ktime_divns(const ktime_t kt, s64 div)
>  
>  	return dclc;
>  }
> -EXPORT_SYMBOL_GPL(ktime_divns);
> +EXPORT_SYMBOL_GPL(__ktime_divns);
>  #endif /* BITS_PER_LONG >= 64 */
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-05 18:00 ` Nicolas Pitre
@ 2014-12-05 21:03   ` Arnd Bergmann
  2014-12-18 21:21     ` John Stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2014-12-05 21:03 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Nicolas Pitre, Thomas Gleixner, John Stultz, linux-kernel

On Friday 05 December 2014 13:00:22 Nicolas Pitre wrote:
> 
> BTW this is worth applying despite the on-going discussion with Arnd 
> on a separate optimization.

Agreed

> On Wed, 3 Dec 2014, Nicolas Pitre wrote:
> 
> > At least on ARM, do_div() is optimized to turn constant divisors into
> > an inline multiplication by the reciprocal value at compile time. 
> > However this optimization is missed entirely whenever ktime_divns() is
> > used and the slow out-of-line division code is used all the time.
> > 
> > Let ktime_divns() use do_div() inline whenever the divisor is constant
> > and small enough.  This will make things like ktime_to_us() and 
> > ktime_to_ms() much faster.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] optimize ktime_divns for constant divisors
  2014-12-05 21:03   ` Arnd Bergmann
@ 2014-12-18 21:21     ` John Stultz
  0 siblings, 0 replies; 15+ messages in thread
From: John Stultz @ 2014-12-18 21:21 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Nicolas Pitre, Thomas Gleixner, lkml

On Fri, Dec 5, 2014 at 1:03 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 05 December 2014 13:00:22 Nicolas Pitre wrote:
>>
>> BTW this is worth applying despite the on-going discussion with Arnd
>> on a separate optimization.
>
> Agreed
>
>> On Wed, 3 Dec 2014, Nicolas Pitre wrote:
>>
>> > At least on ARM, do_div() is optimized to turn constant divisors into
>> > an inline multiplication by the reciprocal value at compile time.
>> > However this optimization is missed entirely whenever ktime_divns() is
>> > used and the slow out-of-line division code is used all the time.
>> >
>> > Let ktime_divns() use do_div() inline whenever the divisor is constant
>> > and small enough.  This will make things like ktime_to_us() and
>> > ktime_to_ms() much faster.
>> >
>> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
>
> Acked-by: Arnd Bergmann <arnd@arndb.de>


Ok, I've queued the original patch up for testing.

thanks
-john

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-12-18 21:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-03 19:43 [PATCH] optimize ktime_divns for constant divisors Nicolas Pitre
2014-12-03 20:03 ` Arnd Bergmann
2014-12-04  7:23   ` Nicolas Pitre
2014-12-04 12:02     ` Arnd Bergmann
2014-12-04 13:46       ` Nicolas Pitre
2014-12-04 14:56         ` Arnd Bergmann
2014-12-04 16:47           ` Nicolas Pitre
     [not found]     ` <OF0EDEDB1C.C03829F7-ON48257DA5.00062083-48257DA5.0007628B@zte.com.cn>
2014-12-05  4:30       ` Nicolas Pitre
2014-12-05 10:08         ` Arnd Bergmann
2014-12-05 17:15           ` Nicolas Pitre
2014-12-03 20:16 ` Robert Jarzmik
2014-12-03 20:37   ` Nicolas Pitre
2014-12-05 18:00 ` Nicolas Pitre
2014-12-05 21:03   ` Arnd Bergmann
2014-12-18 21:21     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).