* [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits @ 2016-11-16 16:57 Chris Metcalf 2016-11-16 18:04 ` John Stultz 0 siblings, 1 reply; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 16:57 UTC (permalink / raw) To: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, linux-kernel Cc: Chris Metcalf For large values of "mult" and long uptimes, the intermediate result of "cycles * mult" can overflow 64 bits. For example, the tile platform uses this helper function; for a 1.2 GHz clock, we have mult = 853, and after 208.5 days, we overflow 64 bits. The fix is basically the same as the fix for arch/x86 __cycles_2_ns() in commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in sched_clock"), using the new mult_frac() helper. In addition to tile, arm/plat-omap and blackfin also use this helper function, so will presumably hit similar issues. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- By the way, this is the bug that I was looking for when I tripped over the missing bugfix for timekeeping_delta_to_ns() a couple of days ago :-) include/linux/clocksource.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 08398182f56e..b2a022acf232 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) */ static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) { - return ((u64) cycles * mult) >> shift; + return mult_frac(cycles, mult, 1ULL << shift); } -- 2.7.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 16:57 [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits Chris Metcalf @ 2016-11-16 18:04 ` John Stultz 2016-11-16 19:30 ` Chris Metcalf 0 siblings, 1 reply; 18+ messages in thread From: John Stultz @ 2016-11-16 18:04 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > For large values of "mult" and long uptimes, the intermediate > result of "cycles * mult" can overflow 64 bits. For example, > the tile platform uses this helper function; for a 1.2 GHz clock, > we have mult = 853, and after 208.5 days, we overflow 64 bits. > > The fix is basically the same as the fix for arch/x86 __cycles_2_ns() > in commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in > sched_clock"), using the new mult_frac() helper. > > In addition to tile, arm/plat-omap and blackfin also use this helper > function, so will presumably hit similar issues. > > Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> > --- > By the way, this is the bug that I was looking for when I tripped over > the missing bugfix for timekeeping_delta_to_ns() a couple of days ago :-) Glad you found your bug! :) > include/linux/clocksource.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index 08398182f56e..b2a022acf232 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) > */ > static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) > { > - return ((u64) cycles * mult) >> shift; > + return mult_frac(cycles, mult, 1ULL << shift); > } So clocksource_cyc2ns() was never intended to be used with indefinitely large cycle values, and it looks like tile and blackfin are abusing the interface (the omap usage provide cycle deltas rather then just the current counter value). I'd suggest instead to move tile/blackfin to using the generic sched_clock implementation that most of the architectures use, or special case the code in the arch specific sched_clock implementations(as x86 does) instead of modifying the common interface to better handle a use case its not intended for. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 18:04 ` John Stultz @ 2016-11-16 19:30 ` Chris Metcalf 2016-11-16 19:35 ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 19:30 UTC (permalink / raw) To: John Stultz Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On 11/16/2016 1:04 PM, John Stultz wrote: > On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: >> include/linux/clocksource.h | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h >> index 08398182f56e..b2a022acf232 100644 >> --- a/include/linux/clocksource.h >> +++ b/include/linux/clocksource.h >> @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) >> */ >> static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) >> { >> - return ((u64) cycles * mult) >> shift; >> + return mult_frac(cycles, mult, 1ULL << shift); >> } > > So clocksource_cyc2ns() was never intended to be used with > indefinitely large cycle values, and it looks like tile and blackfin > are abusing the interface (the omap usage provide cycle deltas rather > then just the current counter value). Well, the interface does just say "convert clocksource cycles to nanoseconds". :-) If you think it's more important that it be a little faster, we should adjust the documentation to say it is only appropriate for delta-cycles, not absolute cycles. I've appended a commit that does this if you'd like to take it. > I'd suggest instead to move tile/blackfin to using the generic > sched_clock implementation that most of the architectures use, or > special case the code in the arch specific sched_clock > implementations(as x86 does) instead of modifying the common interface > to better handle a use case its not intended for. Yes, since tile has a full 64-bit cycle counter, the best thing is to just directly open-code the mult_frac() in tile's sched_clock(). I'll push that change. Steven Miao, I assume you should do the same for blackfin. Thanks! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com From: Chris Metcalf <cmetcalf@mellanox.com> Date: Wed, 16 Nov 2016 14:24:53 -0500 Subject: [PATCH] clocksource_cyc2ns: document intended range limitation The "cycles" argument should not be an absolute clocksource cycle value, as the implementation's arithmetic will overflow relatively easily. For performance, the implementation is simple and fast, since the function is intended for only relatively small delta values of clocksource cycles. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- include/linux/clocksource.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 08398182f56e..5444429884b8 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) * * Converts cycles to nanoseconds, using the given mult and shift. * + * The code is optimized for performance and not intended to work + * with absolute clocksource cycles, as it will easily overflow, + * but just intended for relative (delta) clocksource cycles. + * * XXX - This could use some mult_lxl_ll() asm optimization */ static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift) -- 2.7.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 19:30 ` Chris Metcalf @ 2016-11-16 19:35 ` Chris Metcalf 2016-11-16 19:59 ` John Stultz 2016-11-16 19:40 ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz 2016-11-16 19:45 ` John Stultz 2 siblings, 1 reply; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 19:35 UTC (permalink / raw) To: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, linux-kernel Cc: Chris Metcalf For large values of "mult" and long uptimes, the intermediate result of "cycles * mult" can overflow 64 bits. For example, the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock; we have mult = 853, and after 208.5 days, we overflow 64 bits. Since clocksource_cyc2ns() is intended to be used for relative cycle counts, not absolute cycle counts, performance is more importance than accepting a wider range of cycle values. So, just use mult_frac() directly in tile's sched_clock(). Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> --- Blackfin should make a similar change in their sched_clock(). arch/tile/kernel/time.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index 178989e6d3e3..ea960d660917 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num) */ unsigned long long sched_clock(void) { - return clocksource_cyc2ns(get_cycles(), - sched_clock_mult, SCHED_CLOCK_SHIFT); + return mult_frac(get_cycles(), + sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT); } int setup_profiling_timer(unsigned int multiplier) -- 2.7.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 19:35 ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf @ 2016-11-16 19:59 ` John Stultz 2016-11-16 20:16 ` Chris Metcalf 0 siblings, 1 reply; 18+ messages in thread From: John Stultz @ 2016-11-16 19:59 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 11:35 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > For large values of "mult" and long uptimes, the intermediate > result of "cycles * mult" can overflow 64 bits. For example, > the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock; > we have mult = 853, and after 208.5 days, we overflow 64 bits. > > Since clocksource_cyc2ns() is intended to be used for relative > cycle counts, not absolute cycle counts, performance is more > importance than accepting a wider range of cycle values. > So, just use mult_frac() directly in tile's sched_clock(). > > Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> > --- > Blackfin should make a similar change in their sched_clock(). > > arch/tile/kernel/time.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c > index 178989e6d3e3..ea960d660917 100644 > --- a/arch/tile/kernel/time.c > +++ b/arch/tile/kernel/time.c > @@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num) > */ > unsigned long long sched_clock(void) > { > - return clocksource_cyc2ns(get_cycles(), > - sched_clock_mult, SCHED_CLOCK_SHIFT); > + return mult_frac(get_cycles(), > + sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT); > } So... looking closer at mult_frac(), its a really slow implementation, doing 2 divs and a mod and a mult. Hopefully the compiler can sort out the divs are power of two, and optimize it out, but I'm still hesitant. sched_clock() is normally a very hot-path call, so this might have a real performance impact, especially compared to what its replacing. In your earlier patch, you mentioned this was similar to 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in sched_clock"). It might be better to actually try to use similar logic there, to make sure the performance impact is minimal. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 19:59 ` John Stultz @ 2016-11-16 20:16 ` Chris Metcalf 2016-11-16 20:29 ` John Stultz 2016-11-17 9:53 ` Peter Zijlstra 0 siblings, 2 replies; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 20:16 UTC (permalink / raw) To: John Stultz Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml, Peter Zijlstra On 11/16/2016 2:59 PM, John Stultz wrote: > On Wed, Nov 16, 2016 at 11:35 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: >> For large values of "mult" and long uptimes, the intermediate >> result of "cycles * mult" can overflow 64 bits. For example, >> the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock; >> we have mult = 853, and after 208.5 days, we overflow 64 bits. >> >> Since clocksource_cyc2ns() is intended to be used for relative >> cycle counts, not absolute cycle counts, performance is more >> importance than accepting a wider range of cycle values. >> So, just use mult_frac() directly in tile's sched_clock(). >> >> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> >> --- >> Blackfin should make a similar change in their sched_clock(). >> >> arch/tile/kernel/time.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c >> index 178989e6d3e3..ea960d660917 100644 >> --- a/arch/tile/kernel/time.c >> +++ b/arch/tile/kernel/time.c >> @@ -218,8 +218,8 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num) >> */ >> unsigned long long sched_clock(void) >> { >> - return clocksource_cyc2ns(get_cycles(), >> - sched_clock_mult, SCHED_CLOCK_SHIFT); >> + return mult_frac(get_cycles(), >> + sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT); >> } > So... looking closer at mult_frac(), its a really slow implementation, > doing 2 divs and a mod and a mult. Hopefully the compiler can sort out > the divs are power of two, and optimize it out, but I'm still > hesitant. > > sched_clock() is normally a very hot-path call, so this might have a > real performance impact, especially compared to what its replacing. > > In your earlier patch, you mentioned this was similar to 4cecf6d401a0 > ("sched, x86: Avoid unnecessary overflow in > sched_clock"). It might be better to actually try to use similar logic > there, to make sure the performance impact is minimal. This was the first thing I looked at when I saw the mult_frac() implementation. The modulus operations are indeed converted to bitmasks and the divides to shifts. We do have to do two multiplies instead of one, but that's basically the worst of the cost. Change 4cecf6d401a0 results in essentially identical code for x86 as this proposed change does for tile. In fact a follow-on change by Salman introduced mult_frac() and switched to using it, so it was identical at that point. PeterZ (cc'ed) then improved it to use __int128 math via mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply instead of two, but the multiply is handled by an out-of-line call to __multi3, and the sched_clock() function ends up about 2.5x slower as a result. Thanks for thinking about this! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 20:16 ` Chris Metcalf @ 2016-11-16 20:29 ` John Stultz 2016-11-16 20:31 ` John Stultz 2016-11-17 9:53 ` Peter Zijlstra 1 sibling, 1 reply; 18+ messages in thread From: John Stultz @ 2016-11-16 20:29 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml, Peter Zijlstra On Wed, Nov 16, 2016 at 12:16 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > On 11/16/2016 2:59 PM, John Stultz wrote: >> >> In your earlier patch, you mentioned this was similar to 4cecf6d401a0 >> ("sched, x86: Avoid unnecessary overflow in >> sched_clock"). It might be better to actually try to use similar logic >> there, to make sure the performance impact is minimal. > > > This was the first thing I looked at when I saw the mult_frac() > implementation. The modulus operations are indeed converted to > bitmasks and the divides to shifts. We do have to do two multiplies > instead of one, but that's basically the worst of the cost. > > Change 4cecf6d401a0 results in essentially identical code for x86 as > this proposed change does for tile. In fact a follow-on change by > Salman introduced mult_frac() and switched to using it, so it was > identical at that point. > > PeterZ (cc'ed) then improved it to use __int128 math via > mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply > instead of two, but the multiply is handled by an out-of-line call to > __multi3, and the sched_clock() function ends up about 2.5x slower as > a result. > > Thanks for thinking about this! Heh. Thanks for the history lesson and apologies for my forgetfulness. :) thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 20:29 ` John Stultz @ 2016-11-16 20:31 ` John Stultz 0 siblings, 0 replies; 18+ messages in thread From: John Stultz @ 2016-11-16 20:31 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml, Peter Zijlstra On Wed, Nov 16, 2016 at 12:29 PM, John Stultz <john.stultz@linaro.org> wrote: > On Wed, Nov 16, 2016 at 12:16 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote: >> Change 4cecf6d401a0 results in essentially identical code for x86 as >> this proposed change does for tile. In fact a follow-on change by >> Salman introduced mult_frac() and switched to using it, so it was >> identical at that point. >> >> PeterZ (cc'ed) then improved it to use __int128 math via >> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply >> instead of two, but the multiply is handled by an out-of-line call to >> __multi3, and the sched_clock() function ends up about 2.5x slower as >> a result. >> >> Thanks for thinking about this! > > Heh. Thanks for the history lesson and apologies for my forgetfulness. :) Oh.. and some of these details might be useful to have in the commit message! thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-16 20:16 ` Chris Metcalf 2016-11-16 20:29 ` John Stultz @ 2016-11-17 9:53 ` Peter Zijlstra 2016-11-17 20:00 ` Chris Metcalf 1 sibling, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2016-11-17 9:53 UTC (permalink / raw) To: Chris Metcalf Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote: > PeterZ (cc'ed) then improved it to use __int128 math via > mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply > instead of two, but the multiply is handled by an out-of-line call to > __multi3, and the sched_clock() function ends up about 2.5x slower as > a result. Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces to 2 32x23->64 multiplications, of which one if conditional on there actually being bits set in the high word of the u64 argument. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-17 9:53 ` Peter Zijlstra @ 2016-11-17 20:00 ` Chris Metcalf 2016-11-18 10:34 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Chris Metcalf @ 2016-11-17 20:00 UTC (permalink / raw) To: Peter Zijlstra Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On 11/17/2016 4:53 AM, Peter Zijlstra wrote: > On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote: >> PeterZ (cc'ed) then improved it to use __int128 math via >> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply >> instead of two, but the multiply is handled by an out-of-line call to >> __multi3, and the sched_clock() function ends up about 2.5x slower as >> a result. > Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces > to 2 32x23->64 multiplications, of which one if conditional on there > actually being bits set in the high word of the u64 argument. I didn't notice that. It took me down an interesting rathole. Obviously the branch optimization won't help on cycle counter values, since we blow out of the low 32 bits in the first few seconds of uptime. So the conditional test won't help, but the 32x32 multiply optimizations should. However, I was surprised to discover that the compiler doesn't always catch the 32x32 case. It does for simple cases on gcc 4.4, but if you change the compiler version or the complexity of the code, it can lose sight of the optimization opportunity, and in fact that happens in mul_u64_u32_shr(), and we get 64x64 multiplies. I passed this along to our compiler team as an optimization bug. Given that, it turns out it's always faster to do the unconditional path on tilegx. The basic machine instruction is a 32x32 multiply-accumulate, but unlike most tilegx instructions, it causes a 1-cycle RAW hazard stall if you try to use the result in the next instruction. Similarly, mispredicted branches take a 1-cycle stall. The unconditional code pipelines the multiplies completely and runs in 10 cycles; the conditional code has two RAW hazard stalls and a branch stall, so it takes 12 cycles even when it skips the second multiply. Working around the missed compiler optimization by taking the existing mul_u64_u32_shr() and replacing "*" with calls to __insn_mul_lu_lu() to use the compiler builtin gives a 10-cycle function (assuming we have to do both multiplies). So this is the same performance as the pipelined mult_frac() that does the overlapping 64x64 multiplies. We can do better by providing a completely hand-rolled version of the function, either using "*" if the compiler optimization is fixed, or __insn_mul_lu_lu() if it isn't, that doesn't do a conditional branch: static inline u64 mul_u64_u32_shr(u64 a, u64 mul, unsigned int shift) { return (__insn_mul_lu_lu(a, mul) >> shift) + (__insn_mul_lu_lu(a >> 32, mul) << (32 - shift)); } This compiles down to 5 cycles with no hazard stalls. It's not completely clear where I'd put this to override the <linux/math64.h> version; presumably in <asm/div64.h>? Of course I'd then also have to make it conditional on __tilegx__, since tilepro has a different set of multiply instructions, as it's an ILP32 ISA. I'm a little dubious that it's worth the investment in build scaffolding to do this to save 5 cycles, so I think for now I will just keep the mult_frac() version, and push it to stable to fix the bug with the cycle counter overflowing. Depending on what/when I hear back from the compiler team, I will think about saving those few extra cycles with a custom mul_u64_u32_shr(). -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-17 20:00 ` Chris Metcalf @ 2016-11-18 10:34 ` Peter Zijlstra 2016-11-18 14:24 ` Chris Metcalf 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2016-11-18 10:34 UTC (permalink / raw) To: Chris Metcalf Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Thu, Nov 17, 2016 at 03:00:14PM -0500, Chris Metcalf wrote: > On 11/17/2016 4:53 AM, Peter Zijlstra wrote: > >On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote: > >>PeterZ (cc'ed) then improved it to use __int128 math via > >>mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply > >>instead of two, but the multiply is handled by an out-of-line call to > >>__multi3, and the sched_clock() function ends up about 2.5x slower as > >>a result. > >Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces > >to 2 32x23->64 multiplications, of which one if conditional on there > >actually being bits set in the high word of the u64 argument. > > I didn't notice that. It took me down an interesting rathole. > > Obviously the branch optimization won't help on cycle counter values, > since we blow out of the low 32 bits in the first few seconds of > uptime. So the conditional test won't help, but the 32x32 > multiply optimizations should. Now, I don't quite remember things, but isn't it the idea to convert cycle deltas and accumulate in ns? That way you most always convert small values. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-18 10:34 ` Peter Zijlstra @ 2016-11-18 14:24 ` Chris Metcalf 2016-11-18 14:50 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Chris Metcalf @ 2016-11-18 14:24 UTC (permalink / raw) To: Peter Zijlstra Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On 11/18/2016 5:34 AM, Peter Zijlstra wrote: > On Thu, Nov 17, 2016 at 03:00:14PM -0500, Chris Metcalf wrote: >> On 11/17/2016 4:53 AM, Peter Zijlstra wrote: >>> On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote: >>>> PeterZ (cc'ed) then improved it to use __int128 math via >>>> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply >>>> instead of two, but the multiply is handled by an out-of-line call to >>>> __multi3, and the sched_clock() function ends up about 2.5x slower as >>>> a result. >>> Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces >>> to 2 32x23->64 multiplications, of which one if conditional on there >>> actually being bits set in the high word of the u64 argument. >> I didn't notice that. It took me down an interesting rathole. >> >> Obviously the branch optimization won't help on cycle counter values, >> since we blow out of the low 32 bits in the first few seconds of >> uptime. So the conditional test won't help, but the 32x32 >> multiply optimizations should. > Now, I don't quite remember things, but isn't it the idea to convert > cycle deltas and accumulate in ns? That way you most always convert > small values. I would think you would also unnecessarily accumulate small errors. The x86 sched_clock() seems to purely scale the current TSC value, so what tile is doing is consistent with that, at least. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count 2016-11-18 14:24 ` Chris Metcalf @ 2016-11-18 14:50 ` Peter Zijlstra 0 siblings, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2016-11-18 14:50 UTC (permalink / raw) To: Chris Metcalf Cc: John Stultz, Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Fri, Nov 18, 2016 at 09:24:52AM -0500, Chris Metcalf wrote: > I would think you would also unnecessarily accumulate small errors. True.. > The x86 sched_clock() seems to purely scale the current TSC value, > so what tile is doing is consistent with that, at least. Right, this comes apart the moment TSC goes faster than 1GHz though. Which might actually be the case, because then the mult-and-shift reduces resolution and we'd wrap before the 64bit are done. That would be something I ought to look at some time.. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 19:30 ` Chris Metcalf 2016-11-16 19:35 ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf @ 2016-11-16 19:40 ` John Stultz 2016-11-16 19:45 ` John Stultz 2 siblings, 0 replies; 18+ messages in thread From: John Stultz @ 2016-11-16 19:40 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > On 11/16/2016 1:04 PM, John Stultz wrote: >> >> On Wed, Nov 16, 2016 at 8:57 AM, Chris Metcalf <cmetcalf@mellanox.com> >> wrote: >>> >>> include/linux/clocksource.h | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h >>> index 08398182f56e..b2a022acf232 100644 >>> --- a/include/linux/clocksource.h >>> +++ b/include/linux/clocksource.h >>> @@ -175,7 +175,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 >>> shift_constant) >>> */ >>> static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 >>> shift) >>> { >>> - return ((u64) cycles * mult) >> shift; >>> + return mult_frac(cycles, mult, 1ULL << shift); >>> } >> >> >> So clocksource_cyc2ns() was never intended to be used with >> indefinitely large cycle values, and it looks like tile and blackfin >> are abusing the interface (the omap usage provide cycle deltas rather >> then just the current counter value). > > > Well, the interface does just say "convert clocksource cycles to > nanoseconds". :-) Right, and I can understand the confusion, but its not being used with a struct clocksource. Its just being used to convert get_cycles(). > If you think it's more important that it be a little faster, we should > adjust the > documentation to say it is only appropriate for delta-cycles, not absolute > cycles. > I've appended a commit that does this if you'd like to take it. That's fair. Thanks for sending that, II'll queue that in my tree here in a moment. >> I'd suggest instead to move tile/blackfin to using the generic >> sched_clock implementation that most of the architectures use, or >> special case the code in the arch specific sched_clock >> implementations(as x86 does) instead of modifying the common interface >> to better handle a use case its not intended for. > > > Yes, since tile has a full 64-bit cycle counter, the best thing is to just > directly > open-code the mult_frac() in tile's sched_clock(). I'll push that change. > Steven Miao, I assume you should do the same for blackfin. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 19:30 ` Chris Metcalf 2016-11-16 19:35 ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf 2016-11-16 19:40 ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz @ 2016-11-16 19:45 ` John Stultz 2016-11-16 19:56 ` Chris Metcalf 2 siblings, 1 reply; 18+ messages in thread From: John Stultz @ 2016-11-16 19:45 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index 08398182f56e..5444429884b8 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 > shift_constant) > * > * Converts cycles to nanoseconds, using the given mult and shift. > * > + * The code is optimized for performance and not intended to work > + * with absolute clocksource cycles, as it will easily overflow, > + * but just intended for relative (delta) clocksource cycles. > + * > * XXX - This could use some mult_lxl_ll() asm optimization Just as a heads up, it seems your working against an older kernel, as this didn't apply. Its simple enough to fix up, so I'll do so, but in the future, please submit patches against something close to Linus HEAD. thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 19:45 ` John Stultz @ 2016-11-16 19:56 ` Chris Metcalf 2016-11-16 20:00 ` John Stultz 0 siblings, 1 reply; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 19:56 UTC (permalink / raw) To: John Stultz Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On 11/16/2016 2:45 PM, John Stultz wrote: > On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: >> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h >> index 08398182f56e..5444429884b8 100644 >> --- a/include/linux/clocksource.h >> +++ b/include/linux/clocksource.h >> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 >> shift_constant) >> * >> * Converts cycles to nanoseconds, using the given mult and shift. >> * >> + * The code is optimized for performance and not intended to work >> + * with absolute clocksource cycles, as it will easily overflow, >> + * but just intended for relative (delta) clocksource cycles. >> + * >> * XXX - This could use some mult_lxl_ll() asm optimization > Just as a heads up, it seems your working against an older kernel, as > this didn't apply. Its simple enough to fix up, so I'll do so, but in > the future, please submit patches against something close to Linus > HEAD. Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage. I assumed if I just pasted the patch into Thunderbird it would work, since it had no tabs. But bizarrely, if I look at the patch in the mailer, it shows a two-space prefix, but when I save the email to a file, it has a three-space prefix. WTF? -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 19:56 ` Chris Metcalf @ 2016-11-16 20:00 ` John Stultz 2016-11-16 20:30 ` Chris Metcalf 0 siblings, 1 reply; 18+ messages in thread From: John Stultz @ 2016-11-16 20:00 UTC (permalink / raw) To: Chris Metcalf Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On Wed, Nov 16, 2016 at 11:56 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: > On 11/16/2016 2:45 PM, John Stultz wrote: >> >> On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> >> wrote: >>> >>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h >>> index 08398182f56e..5444429884b8 100644 >>> --- a/include/linux/clocksource.h >>> +++ b/include/linux/clocksource.h >>> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 >>> shift_constant) >>> * >>> * Converts cycles to nanoseconds, using the given mult and shift. >>> * >>> + * The code is optimized for performance and not intended to work >>> + * with absolute clocksource cycles, as it will easily overflow, >>> + * but just intended for relative (delta) clocksource cycles. >>> + * >>> * XXX - This could use some mult_lxl_ll() asm optimization >> >> Just as a heads up, it seems your working against an older kernel, as >> this didn't apply. Its simple enough to fix up, so I'll do so, but in >> the future, please submit patches against something close to Linus >> HEAD. > > > Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage. > I assumed if I just pasted the patch into Thunderbird it would work, since > it had > no tabs. But bizarrely, if I look at the patch in the mailer, it shows a > two-space > prefix, but when I save the email to a file, it has a three-space prefix. > WTF? Yea. Not many mailers can be trusted with sending patches. I'd recommend git-send-email. :) thanks -john ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits 2016-11-16 20:00 ` John Stultz @ 2016-11-16 20:30 ` Chris Metcalf 0 siblings, 0 replies; 18+ messages in thread From: Chris Metcalf @ 2016-11-16 20:30 UTC (permalink / raw) To: John Stultz Cc: Thomas Gleixner, Salman Qazi, Paul Turner, Tony Lindgren, Steven Miao, lkml On 11/16/2016 3:00 PM, John Stultz wrote: > On Wed, Nov 16, 2016 at 11:56 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote: >> On 11/16/2016 2:45 PM, John Stultz wrote: >>> On Wed, Nov 16, 2016 at 11:30 AM, Chris Metcalf <cmetcalf@mellanox.com> >>> wrote: >>>> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h >>>> index 08398182f56e..5444429884b8 100644 >>>> --- a/include/linux/clocksource.h >>>> +++ b/include/linux/clocksource.h >>>> @@ -171,6 +171,10 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 >>>> shift_constant) >>>> * >>>> * Converts cycles to nanoseconds, using the given mult and shift. >>>> * >>>> + * The code is optimized for performance and not intended to work >>>> + * with absolute clocksource cycles, as it will easily overflow, >>>> + * but just intended for relative (delta) clocksource cycles. >>>> + * >>>> * XXX - This could use some mult_lxl_ll() asm optimization >>> Just as a heads up, it seems your working against an older kernel, as >>> this didn't apply. Its simple enough to fix up, so I'll do so, but in >>> the future, please submit patches against something close to Linus >>> HEAD. >> >> Oops, sorry; it wasn't version skew (I'm at v4.9-rc4) but whitespace damage. >> I assumed if I just pasted the patch into Thunderbird it would work, since >> it had >> no tabs. But bizarrely, if I look at the patch in the mailer, it shows a >> two-space >> prefix, but when I save the email to a file, it has a three-space prefix. >> WTF? > Yea. Not many mailers can be trusted with sending patches. I'd > recommend git-send-email. :) Yes, indeed. That is what I always do normally. But since I was already part way through an email written interactively in my mailer, I figured I'd be lazy and paste it in. Lesson learned... That said, it is annoyingly difficult to pick up an in-progress email and convert it into something you can edit in, say, Emacs. You end up having to cut and paste the "To:" and "Cc:" and "Subject:" information out of your mailer's helpful GUI and into Emacs in the standard RFC822 way that git send-email understands. I'm not sure there's a better way, but at least I feel better now that I've complained about it :-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2016-11-18 16:59 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-16 16:57 [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits Chris Metcalf 2016-11-16 18:04 ` John Stultz 2016-11-16 19:30 ` Chris Metcalf 2016-11-16 19:35 ` [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count Chris Metcalf 2016-11-16 19:59 ` John Stultz 2016-11-16 20:16 ` Chris Metcalf 2016-11-16 20:29 ` John Stultz 2016-11-16 20:31 ` John Stultz 2016-11-17 9:53 ` Peter Zijlstra 2016-11-17 20:00 ` Chris Metcalf 2016-11-18 10:34 ` Peter Zijlstra 2016-11-18 14:24 ` Chris Metcalf 2016-11-18 14:50 ` Peter Zijlstra 2016-11-16 19:40 ` [PATCH] clocksource_cyc2ns: avoid overflowing 64 bits John Stultz 2016-11-16 19:45 ` John Stultz 2016-11-16 19:56 ` Chris Metcalf 2016-11-16 20:00 ` John Stultz 2016-11-16 20:30 ` Chris Metcalf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).