linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
@ 2016-11-16 16:57 Chris Metcalf
  2016-11-16 18:04 ` John Stultz
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 16:57 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, linux-kernel
  Cc: Chris Metcalf

For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits.  For example,
the tile platform uses this helper function; for a 1.2 GHz clock,
we have mult = 853, and after 208.5 days, we overflow 64 bits.

The fix is basically the same as the fix for arch/x86 __cycles_2_ns()
in commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in
sched_clock"), using the new mult_frac() helper.

In addition to tile, arm/plat-omap and blackfin also use this helper
function, so will presumably hit similar issues.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
By the way, this is the bug that I was looking for when I tripped over
the missing bugfix for timekeeping_delta_to_ns() a couple of days ago :-)

 include/linux/clocksource.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 08398182f56e..b2a022acf232 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
  */
 static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
 {
-	return ((u64) cycles * mult) >> shift;
+	return mult_frac(cycles, mult, 1ULL << shift);
 }
 
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 16:57 [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits Chris Metcalf
@ 2016-11-16 18:04 ` John Stultz
  2016-11-16 19:30   ` Chris Metcalf
  0 siblings, 1 reply; 18+ messages in thread
From: John Stultz @ 2016-11-16 18:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> For large values of "mult" and long uptimes, the intermediate
> result of "cycles * mult" can overflow 64 bits.  For example,
> the tile platform uses this helper function; for a 1.2 GHz clock,
> we have mult = 853, and after 208.5 days, we overflow 64 bits.
>
> The fix is basically the same as the fix for arch/x86 __cycles_2_ns()
> in commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in
> sched_clock"), using the new mult_frac() helper.
>
> In addition to tile, arm/plat-omap and blackfin also use this helper
> function, so will presumably hit similar issues.
>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
> By the way, this is the bug that I was looking for when I tripped over
> the missing bugfix for timekeeping_delta_to_ns() a couple of days ago :-)

Glad you found your bug! :)

>  include/linux/clocksource.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 08398182f56e..b2a022acf232 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
>   */
>  static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
>  {
> -       return ((u64) cycles * mult) >> shift;
> +       return mult_frac(cycles, mult, 1ULL << shift);
>  }


So clocksource_cyc2ns() was never intended to be used with
indefinitely large cycle values, and it looks like tile and blackfin
are abusing the interface (the omap usage provide cycle deltas rather
then just the current counter value).

I'd suggest instead to move tile/blackfin to using the generic
sched_clock implementation that most of the architectures use, or
special case the code in the arch specific sched_clock
implementations(as x86 does) instead of modifying the common interface
to better handle a use case its not intended for.

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 18:04 ` John Stultz
@ 2016-11-16 19:30   ` Chris Metcalf
  2016-11-16 19:35     ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 19:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On 11/16/2016 1:04 PM, John Stultz wrote:
> On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>   include/linux/clocksource.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
>> index 08398182f56e..b2a022acf232 100644
>> --- a/include/linux/clocksource.h
>> +++ b/include/linux/clocksource.h
>> @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
>>    */
>>   static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
>>   {
>> -       return ((u64) cycles * mult) >> shift;
>> +       return mult_frac(cycles, mult, 1ULL << shift);
>>   }
>
> So clocksource_cyc2ns() was never intended to be used with
> indefinitely large cycle values, and it looks like tile and blackfin
> are abusing the interface (the omap usage provide cycle deltas rather
> then just the current counter value).

Well, the interface does just say "convert clocksource cycles to nanoseconds". :-)
If you think it's more important that it be a little faster, we should adjust the
documentation to say it is only appropriate for delta-cycles, not absolute cycles.
I've appended a commit that does this if you'd like to take it.

> I'd suggest instead to move tile/blackfin to using the generic
> sched_clock implementation that most of the architectures use, or
> special case the code in the arch specific sched_clock
> implementations(as x86 does) instead of modifying the common interface
> to better handle a use case its not intended for.

Yes, since tile has a full 64-bit cycle counter, the best thing is to just directly
open-code the mult_frac() in tile's sched_clock().  I'll push that change.
Steven Miao, I assume you should do the same for blackfin.

Thanks!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

From: Chris Metcalf <cmetcalf@mellanox.com>
Date: Wed, 16 Nov 2016 14:24:53 -0500
Subject: [PATCH] clocksource_cyc2ns: document intended range limitation

The "cycles" argument should not be an absolute clocksource cycle
value, as the implementation's arithmetic will overflow relatively
easily.  For performance, the implementation is simple and fast,
since the function is intended for only relatively small delta
values of clocksource cycles.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
  include/linux/clocksource.h | 4 ++++
  1 file changed, 4 insertions(+)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 08398182f56e..5444429884b8 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
   *
   * Converts cycles to nanoseconds, using the given mult and shift.
   *
+ * The code is optimized for performance and not intended to work
+ * with absolute clocksource cycles, as it will easily overflow,
+ * but just intended for relative (delta) clocksource cycles.
+ *
   * XXX - This could use some mult_lxl_ll() asm optimization
   */
  static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 19:30   ` Chris Metcalf
@ 2016-11-16 19:35     ` Chris Metcalf
  2016-11-16 19:59       ` John Stultz
  2016-11-16 19:40     ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz
  2016-11-16 19:45     ` John Stultz
  2 siblings, 1 reply; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 19:35 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, linux-kernel
  Cc: Chris Metcalf

For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits.  For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.

Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values.
So, just use mult_frac() directly in tile's sched_clock().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
Blackfin should make a similar change in their sched_clock().

 arch/tile/kernel/time.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..ea960d660917 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
  */
 unsigned long long sched_clock(void)
 {
-	return clocksource_cyc2ns(get_cycles(),
-				  sched_clock_mult, SCHED_CLOCK_SHIFT);
+	return mult_frac(get_cycles(),
+			 sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
 }
 
 int setup_profiling_timer(unsigned int multiplier)
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 19:30   ` Chris Metcalf
  2016-11-16 19:35     ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf
@ 2016-11-16 19:40     ` John Stultz
  2016-11-16 19:45     ` John Stultz
  2 siblings, 0 replies; 18+ messages in thread
From: John Stultz @ 2016-11-16 19:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 11/16/2016 1:04 PM, John Stultz wrote:
>>
>> On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com>
>> wrote:
>>>
>>>   include/linux/clocksource.h | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
>>> index 08398182f56e..b2a022acf232 100644
>>> --- a/include/linux/clocksource.h
>>> +++ b/include/linux/clocksource.h
>>> @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32
>>> shift_constant)
>>>    */
>>>   static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32
>>> shift)
>>>   {
>>> -       return ((u64) cycles * mult) >> shift;
>>> +       return mult_frac(cycles, mult, 1ULL << shift);
>>>   }
>>
>>
>> So clocksource_cyc2ns() was never intended to be used with
>> indefinitely large cycle values, and it looks like tile and blackfin
>> are abusing the interface (the omap usage provide cycle deltas rather
>> then just the current counter value).
>
>
> Well, the interface does just say "convert clocksource cycles to
> nanoseconds". :-)

Right, and I can understand the confusion, but its not being used with
a struct clocksource. Its just being used to convert get_cycles().

> If you think it's more important that it be a little faster, we should
> adjust the
> documentation to say it is only appropriate for delta-cycles, not absolute
> cycles.
> I've appended a commit that does this if you'd like to take it.

That's fair. Thanks for sending that, II'll queue that in my tree here
in a moment.

>> I'd suggest instead to move tile/blackfin to using the generic
>> sched_clock implementation that most of the architectures use, or
>> special case the code in the arch specific sched_clock
>> implementations(as x86 does) instead of modifying the common interface
>> to better handle a use case its not intended for.
>
>
> Yes, since tile has a full 64-bit cycle counter, the best thing is to just
> directly
> open-code the mult_frac() in tile's sched_clock().  I'll push that change.
> Steven Miao, I assume you should do the same for blackfin.

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 19:30   ` Chris Metcalf
  2016-11-16 19:35     ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf
  2016-11-16 19:40     ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz
@ 2016-11-16 19:45     ` John Stultz
  2016-11-16 19:56       ` Chris Metcalf
  2 siblings, 1 reply; 18+ messages in thread
From: John Stultz @ 2016-11-16 19:45 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 08398182f56e..5444429884b8 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32
> shift_constant)
>   *
>   * Converts cycles to nanoseconds, using the given mult and shift.
>   *
> + * The code is optimized for performance and not intended to work
> + * with absolute clocksource cycles, as it will easily overflow,
> + * but just intended for relative (delta) clocksource cycles.
> + *
>   * XXX - This could use some mult_lxl_ll() asm optimization

Just as a heads up, it seems your working against an older kernel, as
this didn't apply. Its simple enough to fix up, so I'll do so, but in
the future, please submit patches against something close to Linus
HEAD.

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 19:45     ` John Stultz
@ 2016-11-16 19:56       ` Chris Metcalf
  2016-11-16 20:00         ` John Stultz
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 19:56 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On 11/16/2016 2:45 PM, John Stultz wrote:
> On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
>> index 08398182f56e..5444429884b8 100644
>> --- a/include/linux/clocksource.h
>> +++ b/include/linux/clocksource.h
>> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32
>> shift_constant)
>>    *
>>    * Converts cycles to nanoseconds, using the given mult and shift.
>>    *
>> + * The code is optimized for performance and not intended to work
>> + * with absolute clocksource cycles, as it will easily overflow,
>> + * but just intended for relative (delta) clocksource cycles.
>> + *
>>    * XXX - This could use some mult_lxl_ll() asm optimization
> Just as a heads up, it seems your working against an older kernel, as
> this didn't apply. Its simple enough to fix up, so I'll do so, but in
> the future, please submit patches against something close to Linus
> HEAD.

Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage.
I assumed if I just pasted the patch into Thunderbird it would work, since it had
no tabs.  But bizarrely, if I look at the patch in the mailer, it shows a two-space
prefix, but when I save the email to a file, it has a three-space prefix.  WTF?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 19:35     ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf
@ 2016-11-16 19:59       ` John Stultz
  2016-11-16 20:16         ` Chris Metcalf
  0 siblings, 1 reply; 18+ messages in thread
From: John Stultz @ 2016-11-16 19:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On Wed, Nov 16, 2016 at 11:35 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> For large values of "mult" and long uptimes, the intermediate
> result of "cycles * mult" can overflow 64 bits.  For example,
> the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
> we have mult = 853, and after 208.5 days, we overflow 64 bits.
>
> Since clocksource_cyc2ns() is intended to be used for relative
> cycle counts, not absolute cycle counts, performance is more
> importance than accepting a wider range of cycle values.
> So, just use mult_frac() directly in tile's sched_clock().
>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
> Blackfin should make a similar change in their sched_clock().
>
>  arch/tile/kernel/time.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
> index 178989e6d3e3..ea960d660917 100644
> --- a/arch/tile/kernel/time.c
> +++ b/arch/tile/kernel/time.c
> @@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
>   */
>  unsigned long long sched_clock(void)
>  {
> -       return clocksource_cyc2ns(get_cycles(),
> -                                 sched_clock_mult, SCHED_CLOCK_SHIFT);
> +       return mult_frac(get_cycles(),
> +                        sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
>  }

So... looking closer at mult_frac(), its a really slow implementation,
doing 2 divs and a mod and a mult. Hopefully the compiler can sort out
the divs are power of two, and optimize it out, but I'm still
hesitant.

sched_clock() is normally a very hot-path call, so this might have a
real performance impact, especially compared to what its replacing.

In your earlier patch, you mentioned this was similar to 4cecf6d401a0
("sched, x86: Avoid unnecessary overflow in
sched_clock"). It might be better to actually try to use similar logic
there, to make sure the performance impact is minimal.

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 19:56       ` Chris Metcalf
@ 2016-11-16 20:00         ` John Stultz
  2016-11-16 20:30           ` Chris Metcalf
  0 siblings, 1 reply; 18+ messages in thread
From: John Stultz @ 2016-11-16 20:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On Wed, Nov 16, 2016 at 11:56 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 11/16/2016 2:45 PM, John Stultz wrote:
>>
>> On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com>
>> wrote:
>>>
>>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
>>> index 08398182f56e..5444429884b8 100644
>>> --- a/include/linux/clocksource.h
>>> +++ b/include/linux/clocksource.h
>>> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32
>>> shift_constant)
>>>    *
>>>    * Converts cycles to nanoseconds, using the given mult and shift.
>>>    *
>>> + * The code is optimized for performance and not intended to work
>>> + * with absolute clocksource cycles, as it will easily overflow,
>>> + * but just intended for relative (delta) clocksource cycles.
>>> + *
>>>    * XXX - This could use some mult_lxl_ll() asm optimization
>>
>> Just as a heads up, it seems your working against an older kernel, as
>> this didn't apply. Its simple enough to fix up, so I'll do so, but in
>> the future, please submit patches against something close to Linus
>> HEAD.
>
>
> Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage.
> I assumed if I just pasted the patch into Thunderbird it would work, since
> it had
> no tabs.  But bizarrely, if I look at the patch in the mailer, it shows a
> two-space
> prefix, but when I save the email to a file, it has a three-space prefix.
> WTF?

Yea. Not many mailers can be trusted with sending patches. I'd
recommend git-send-email. :)

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 19:59       ` John Stultz
@ 2016-11-16 20:16         ` Chris Metcalf
  2016-11-16 20:29           ` John Stultz
  2016-11-17  9:53           ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 20:16 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml, Peter Zijlstra

On 11/16/2016 2:59 PM, John Stultz wrote:
> On Wed, Nov 16, 2016 at 11:35 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> For large values of "mult" and long uptimes, the intermediate
>> result of "cycles * mult" can overflow 64 bits.  For example,
>> the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
>> we have mult = 853, and after 208.5 days, we overflow 64 bits.
>>
>> Since clocksource_cyc2ns() is intended to be used for relative
>> cycle counts, not absolute cycle counts, performance is more
>> importance than accepting a wider range of cycle values.
>> So, just use mult_frac() directly in tile's sched_clock().
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
>> ---
>> Blackfin should make a similar change in their sched_clock().
>>
>>   arch/tile/kernel/time.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
>> index 178989e6d3e3..ea960d660917 100644
>> --- a/arch/tile/kernel/time.c
>> +++ b/arch/tile/kernel/time.c
>> @@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
>>    */
>>   unsigned long long sched_clock(void)
>>   {
>> -       return clocksource_cyc2ns(get_cycles(),
>> -                                 sched_clock_mult, SCHED_CLOCK_SHIFT);
>> +       return mult_frac(get_cycles(),
>> +                        sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
>>   }
> So... looking closer at mult_frac(), its a really slow implementation,
> doing 2 divs and a mod and a mult. Hopefully the compiler can sort out
> the divs are power of two, and optimize it out, but I'm still
> hesitant.
>
> sched_clock() is normally a very hot-path call, so this might have a
> real performance impact, especially compared to what its replacing.
>
> In your earlier patch, you mentioned this was similar to 4cecf6d401a0
> ("sched, x86: Avoid unnecessary overflow in
> sched_clock"). It might be better to actually try to use similar logic
> there, to make sure the performance impact is minimal.

This was the first thing I looked at when I saw the mult_frac()
implementation.  The modulus operations are indeed converted to
bitmasks and the divides to shifts.  We do have to do two multiplies
instead of one, but that's basically the worst of the cost.

Change 4cecf6d401a0 results in essentially identical code for x86 as
this proposed change does for tile.  In fact a follow-on change by
Salman introduced mult_frac() and switched to using it, so it was
identical at that point.

PeterZ (cc'ed) then improved it to use __int128 math via
mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
instead of two, but the multiply is handled by an out-of-line call to
__multi3, and the sched_clock() function ends up about 2.5x slower as
a result.

Thanks for thinking about this!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 20:16         ` Chris Metcalf
@ 2016-11-16 20:29           ` John Stultz
  2016-11-16 20:31             ` John Stultz
  2016-11-17  9:53           ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: John Stultz @ 2016-11-16 20:29 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml, Peter Zijlstra

On Wed, Nov 16, 2016 at 12:16 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 11/16/2016 2:59 PM, John Stultz wrote:
>>
>> In your earlier patch, you mentioned this was similar to 4cecf6d401a0
>> ("sched, x86: Avoid unnecessary overflow in
>> sched_clock"). It might be better to actually try to use similar logic
>> there, to make sure the performance impact is minimal.
>
>
> This was the first thing I looked at when I saw the mult_frac()
> implementation.  The modulus operations are indeed converted to
> bitmasks and the divides to shifts.  We do have to do two multiplies
> instead of one, but that's basically the worst of the cost.
>
> Change 4cecf6d401a0 results in essentially identical code for x86 as
> this proposed change does for tile.  In fact a follow-on change by
> Salman introduced mult_frac() and switched to using it, so it was
> identical at that point.
>
> PeterZ (cc'ed) then improved it to use __int128 math via
> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
> instead of two, but the multiply is handled by an out-of-line call to
> __multi3, and the sched_clock() function ends up about 2.5x slower as
> a result.
>
> Thanks for thinking about this!

Heh. Thanks for the history lesson and apologies for my forgetfulness. :)

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits
  2016-11-16 20:00         ` John Stultz
@ 2016-11-16 20:30           ` Chris Metcalf
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Metcalf @ 2016-11-16 20:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml

On 11/16/2016 3:00 PM, John Stultz wrote:
> On Wed, Nov 16, 2016 at 11:56 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> On 11/16/2016 2:45 PM, John Stultz wrote:
>>> On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com>
>>> wrote:
>>>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
>>>> index 08398182f56e..5444429884b8 100644
>>>> --- a/include/linux/clocksource.h
>>>> +++ b/include/linux/clocksource.h
>>>> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32
>>>> shift_constant)
>>>>     *
>>>>     * Converts cycles to nanoseconds, using the given mult and shift.
>>>>     *
>>>> + * The code is optimized for performance and not intended to work
>>>> + * with absolute clocksource cycles, as it will easily overflow,
>>>> + * but just intended for relative (delta) clocksource cycles.
>>>> + *
>>>>     * XXX - This could use some mult_lxl_ll() asm optimization
>>> Just as a heads up, it seems your working against an older kernel, as
>>> this didn't apply. Its simple enough to fix up, so I'll do so, but in
>>> the future, please submit patches against something close to Linus
>>> HEAD.
>>
>> Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage.
>> I assumed if I just pasted the patch into Thunderbird it would work, since
>> it had
>> no tabs.  But bizarrely, if I look at the patch in the mailer, it shows a
>> two-space
>> prefix, but when I save the email to a file, it has a three-space prefix.
>> WTF?
> Yea. Not many mailers can be trusted with sending patches. I'd
> recommend git-send-email. :)

Yes, indeed.  That is what I always do normally.  But since I was
already part way through an email written interactively in my mailer,
I figured I'd be lazy and paste it in.  Lesson learned...

That said, it is annoyingly difficult to pick up an in-progress email
and convert it into something you can edit in, say, Emacs.  You end up
having to cut and paste the "To:" and "Cc:" and "Subject:" information
out of your mailer's helpful GUI and into Emacs in the standard RFC822
way that git send-email understands.  I'm not sure there's a better
way, but at least I feel better now that I've complained about it :-)

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 20:29           ` John Stultz
@ 2016-11-16 20:31             ` John Stultz
  0 siblings, 0 replies; 18+ messages in thread
From: John Stultz @ 2016-11-16 20:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren,
	Steven Miao, lkml, Peter Zijlstra

On Wed, Nov 16, 2016 at 12:29 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Wed, Nov 16, 2016 at 12:16 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> Change 4cecf6d401a0 results in essentially identical code for x86 as
>> this proposed change does for tile.  In fact a follow-on change by
>> Salman introduced mult_frac() and switched to using it, so it was
>> identical at that point.
>>
>> PeterZ (cc'ed) then improved it to use __int128 math via
>> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
>> instead of two, but the multiply is handled by an out-of-line call to
>> __multi3, and the sched_clock() function ends up about 2.5x slower as
>> a result.
>>
>> Thanks for thinking about this!
>
> Heh. Thanks for the history lesson and apologies for my forgetfulness. :)

Oh.. and some of these details might be useful to have in the commit message!

thanks
-john

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-16 20:16         ` Chris Metcalf
  2016-11-16 20:29           ` John Stultz
@ 2016-11-17  9:53           ` Peter Zijlstra
  2016-11-17 20:00             ` Chris Metcalf
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-11-17  9:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, lkml

On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote:
> PeterZ (cc'ed) then improved it to use __int128 math via
> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
> instead of two, but the multiply is handled by an out-of-line call to
> __multi3, and the sched_clock() function ends up about 2.5x slower as
> a result.

Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces
to 2 32x23->64 multiplications, of which one if conditional on there
actually being bits set in the high word of the u64 argument.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-17  9:53           ` Peter Zijlstra
@ 2016-11-17 20:00             ` Chris Metcalf
  2016-11-18 10:34               ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Metcalf @ 2016-11-17 20:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, lkml

On 11/17/2016 4:53 AM, Peter Zijlstra wrote:
> On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote:
>> PeterZ (cc'ed) then improved it to use __int128 math via
>> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
>> instead of two, but the multiply is handled by an out-of-line call to
>> __multi3, and the sched_clock() function ends up about 2.5x slower as
>> a result.
> Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces
> to 2 32x23->64 multiplications, of which one if conditional on there
> actually being bits set in the high word of the u64 argument.

I didn't notice that.  It took me down an interesting rathole.

Obviously the branch optimization won't help on cycle counter values,
since we blow out of the low 32 bits in the first few seconds of
uptime.  So the conditional test won't help, but the 32x32
multiply optimizations should.

However, I was surprised to discover that the compiler doesn't always
catch the 32x32 case.  It does for simple cases on gcc 4.4, but if
you change the compiler version or the complexity of the code, it can
lose sight of the optimization opportunity, and in fact that happens
in mul_u64_u32_shr(), and we get 64x64 multiplies.  I passed this
along to our compiler team as an optimization bug.

Given that, it turns out it's always faster to do the unconditional
path on tilegx.  The basic machine instruction is a 32x32
multiply-accumulate, but unlike most tilegx instructions, it causes a
1-cycle RAW hazard stall if you try to use the result in the next
instruction.  Similarly, mispredicted branches take a 1-cycle stall.
The unconditional code pipelines the multiplies completely and runs in
10 cycles; the conditional code has two RAW hazard stalls and a branch
stall, so it takes 12 cycles even when it skips the second multiply.

Working around the missed compiler optimization by taking the existing
mul_u64_u32_shr() and replacing "*" with calls to __insn_mul_lu_lu()
to use the compiler builtin gives a 10-cycle function (assuming we
have to do both multiplies).  So this is the same performance as the
pipelined mult_frac() that does the overlapping 64x64 multiplies.

We can do better by providing a completely hand-rolled version of the
function, either using "*" if the compiler optimization is fixed, or
__insn_mul_lu_lu() if it isn't, that doesn't do a conditional branch:

static inline u64 mul_u64_u32_shr(u64 a, u64 mul, unsigned int shift)
{
     return (__insn_mul_lu_lu(a, mul) >> shift) +
         (__insn_mul_lu_lu(a >> 32, mul) << (32 - shift));
}

This compiles down to 5 cycles with no hazard stalls.  It's not
completely clear where I'd put this to override the <linux/math64.h>
version; presumably in <asm/div64.h>?  Of course I'd then also have to
make it conditional on __tilegx__, since tilepro has a different set
of multiply instructions, as it's an ILP32 ISA.

I'm a little dubious that it's worth the investment in build
scaffolding to do this to save 5 cycles, so I think for now I will
just keep the mult_frac() version, and push it to stable to fix the
bug with the cycle counter overflowing.  Depending on
what/when I hear back from the compiler team, I will think about
saving those few extra cycles with a custom mul_u64_u32_shr().

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-17 20:00             ` Chris Metcalf
@ 2016-11-18 10:34               ` Peter Zijlstra
  2016-11-18 14:24                 ` Chris Metcalf
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-11-18 10:34 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, lkml

On Thu, Nov 17, 2016 at 03:00:14PM -0500, Chris Metcalf wrote:
> On 11/17/2016 4:53 AM, Peter Zijlstra wrote:
> >On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote:
> >>PeterZ (cc'ed) then improved it to use __int128 math via
> >>mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
> >>instead of two, but the multiply is handled by an out-of-line call to
> >>__multi3, and the sched_clock() function ends up about 2.5x slower as
> >>a result.
> >Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces
> >to 2 32x23->64 multiplications, of which one if conditional on there
> >actually being bits set in the high word of the u64 argument.
> 
> I didn't notice that.  It took me down an interesting rathole.
> 
> Obviously the branch optimization won't help on cycle counter values,
> since we blow out of the low 32 bits in the first few seconds of
> uptime.  So the conditional test won't help, but the 32x32
> multiply optimizations should.

Now, I don't quite remember things, but isn't it the idea to convert
cycle deltas and accumulate in ns? That way you most always convert
small values.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-18 10:34               ` Peter Zijlstra
@ 2016-11-18 14:24                 ` Chris Metcalf
  2016-11-18 14:50                   ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Metcalf @ 2016-11-18 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, lkml

On 11/18/2016 5:34 AM, Peter Zijlstra wrote:
> On Thu, Nov 17, 2016 at 03:00:14PM -0500, Chris Metcalf wrote:
>> On 11/17/2016 4:53 AM, Peter Zijlstra wrote:
>>> On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote:
>>>> PeterZ (cc'ed) then improved it to use __int128 math via
>>>> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
>>>> instead of two, but the multiply is handled by an out-of-line call to
>>>> __multi3, and the sched_clock() function ends up about 2.5x slower as
>>>> a result.
>>> Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces
>>> to 2 32x23->64 multiplications, of which one if conditional on there
>>> actually being bits set in the high word of the u64 argument.
>> I didn't notice that.  It took me down an interesting rathole.
>>
>> Obviously the branch optimization won't help on cycle counter values,
>> since we blow out of the low 32 bits in the first few seconds of
>> uptime.  So the conditional test won't help, but the 32x32
>> multiply optimizations should.
> Now, I don't quite remember things, but isn't it the idea to convert
> cycle deltas and accumulate in ns? That way you most always convert
> small values.

I would think you would also unnecessarily accumulate small errors.

The x86 sched_clock() seems to purely scale the current TSC value,
so what tile is doing is consistent with that, at least.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count
  2016-11-18 14:24                 ` Chris Metcalf
@ 2016-11-18 14:50                   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-11-18 14:50 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner,
	Tony Lindgren, Steven Miao, lkml

On Fri, Nov 18, 2016 at 09:24:52AM -0500, Chris Metcalf wrote:
> I would think you would also unnecessarily accumulate small errors.

True..

> The x86 sched_clock() seems to purely scale the current TSC value,
> so what tile is doing is consistent with that, at least.

Right, this comes apart the moment TSC goes faster than 1GHz though.
Which might actually be the case, because then the mult-and-shift
reduces resolution and we'd wrap before the 64bit are done.

That would be something I ought to look at some time..

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-11-18 16:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-16 16:57 [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits Chris Metcalf
2016-11-16 18:04 ` John Stultz
2016-11-16 19:30   ` Chris Metcalf
2016-11-16 19:35     ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf
2016-11-16 19:59       ` John Stultz
2016-11-16 20:16         ` Chris Metcalf
2016-11-16 20:29           ` John Stultz
2016-11-16 20:31             ` John Stultz
2016-11-17  9:53           ` Peter Zijlstra
2016-11-17 20:00             ` Chris Metcalf
2016-11-18 10:34               ` Peter Zijlstra
2016-11-18 14:24                 ` Chris Metcalf
2016-11-18 14:50                   ` Peter Zijlstra
2016-11-16 19:40     ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz
2016-11-16 19:45     ` John Stultz
2016-11-16 19:56       ` Chris Metcalf
2016-11-16 20:00         ` John Stultz
2016-11-16 20:30           ` Chris Metcalf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).