linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] timers: Fix timer inaccuracy
@ 2016-11-09  9:36 Joonwoo Park
  2016-11-09  9:56 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Joonwoo Park @ 2016-11-09  9:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joonwoo Park, John Stultz, Eric Dumazet, Frederic Weisbecker,
	Linus Torvalds, Paul E. McKenney, Peter Zijlstra, linux-kernel

When a new timer list enqueued into the time wheel, array index
for the given expiry time is:

  expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
  idx = LVL_OFFS(lvl) + (expires & LVL_MASK);

The granularity of the expiry time level is being added to the index
in order to fire the timer after its expiry time for the case when
the timer cannot fire at the exact time because of each level's
granularity.  However current index calculation also increases index
of timer list even if the timer can fire at exact time.  Consequently
timers which can fire at exact time including all in the first level
of bucket fire with one jiffy delay at present.

Fix such inaccuracy by adding granularity of expiry time level only
when a given timer cannot fire at exact time.

With CONFIG_HZ_100=y

Before:
  225.768008: timer_start: timer=ffffffffa00042c0 function=timer_func [timer] expires=4294959868 [timeout=2] flags=0x0e800000
  <snip>
  225.797961: timer_expire_entry: timer=ffffffffa00042c0 function=timer_func [timer] now=4294959869

After:
   54.424805: timer_start: timer=ffffffffa00042c0 function=timer_func [timer] expires=4294942730 [timeout=2] flags=0x10400000
   <snip>
   54.444764: timer_expire_entry: timer=ffffffffa00042c0 function=timer_func [timer] now=4294942730

Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
---
 kernel/time/timer.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c611c47..f6ad4e9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -467,17 +467,29 @@ static inline void timer_set_idx(struct timer_list *timer, unsigned int idx)
  */
 static inline unsigned calc_index(unsigned expires, unsigned lvl)
 {
-	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+	if (expires & ~(UINT_MAX << LVL_SHIFT(lvl)))
+		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+	else
+		expires = expires >> LVL_SHIFT(lvl);
 	return LVL_OFFS(lvl) + (expires & LVL_MASK);
 }
 
+static inline unsigned calc_index_min_granularity(unsigned expires)
+{
+	return LVL_OFFS(0) + ((expires >> LVL_SHIFT(0)) & LVL_MASK);
+}
+
 static int calc_wheel_index(unsigned long expires, unsigned long clk)
 {
 	unsigned long delta = expires - clk;
 	unsigned int idx;
 
 	if (delta < LVL_START(1)) {
-		idx = calc_index(expires, 0);
+		/*
+		 * calc_index(expires, 0) should still work but we can
+		 * optimize as LVL_SHIFT(0) is always 0.
+		 */
+		idx = calc_index_min_granularity(expires);
 	} else if (delta < LVL_START(2)) {
 		idx = calc_index(expires, 1);
 	} else if (delta < LVL_START(3)) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-09  9:36 [PATCH] timers: Fix timer inaccuracy Joonwoo Park
@ 2016-11-09  9:56 ` Thomas Gleixner
  2016-11-10  1:32   ` Joonwoo Park
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2016-11-09  9:56 UTC (permalink / raw)
  To: Joonwoo Park
  Cc: John Stultz, Eric Dumazet, Frederic Weisbecker, Linus Torvalds,
	Paul E. McKenney, Peter Zijlstra, linux-kernel

On Wed, 9 Nov 2016, Joonwoo Park wrote:

> When a new timer list enqueued into the time wheel, array index
> for the given expiry time is:
> 
>   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>   idx = LVL_OFFS(lvl) + (expires & LVL_MASK);
> 
> The granularity of the expiry time level is being added to the index
> in order to fire the timer after its expiry time for the case when
> the timer cannot fire at the exact time because of each level's
> granularity.  However current index calculation also increases index
> of timer list even if the timer can fire at exact time.  Consequently
> timers which can fire at exact time including all in the first level
> of bucket fire with one jiffy delay at present.
> 
> Fix such inaccuracy by adding granularity of expiry time level only
> when a given timer cannot fire at exact time.

That's simply wrong. We guarantee that the timer sleeps for at least a
jiffy. So in case of the first wheel we _must_ increment by one simply
because the next jiffie might be immanent and not doing so would expire the
timer early.

The wheel is not meant to be accurate at all and I really don't want an
extra conditional in that path just to make it accurate for some expiry
values. That's a completely pointless exercise.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-09  9:56 ` Thomas Gleixner
@ 2016-11-10  1:32   ` Joonwoo Park
  2016-11-10 10:07     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Joonwoo Park @ 2016-11-10  1:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Eric Dumazet, Frederic Weisbecker, Linus Torvalds,
	Paul E. McKenney, Peter Zijlstra, linux-kernel, linux-pm



On 11/09/2016 01:56 AM, Thomas Gleixner wrote:
> On Wed, 9 Nov 2016, Joonwoo Park wrote:
>
>> When a new timer list enqueued into the time wheel, array index
>> for the given expiry time is:
>>
>>   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>>   idx = LVL_OFFS(lvl) + (expires & LVL_MASK);
>>
>> The granularity of the expiry time level is being added to the index
>> in order to fire the timer after its expiry time for the case when
>> the timer cannot fire at the exact time because of each level's
>> granularity.  However current index calculation also increases index
>> of timer list even if the timer can fire at exact time.  Consequently
>> timers which can fire at exact time including all in the first level
>> of bucket fire with one jiffy delay at present.
>>
>> Fix such inaccuracy by adding granularity of expiry time level only
>> when a given timer cannot fire at exact time.
>
> That's simply wrong. We guarantee that the timer sleeps for at least a
> jiffy. So in case of the first wheel we _must_ increment by one simply
> because the next jiffie might be immanent and not doing so would expire the
> timer early.
>
> The wheel is not meant to be accurate at all and I really don't want an
> extra conditional in that path just to make it accurate for some expiry
> values. That's a completely pointless exercise.

I understand it's not meant to provide much of accuracy and also don't 
really care about accuracy of sporadic timer aim and fire.
What I'm worried about is case that relies on periodic timer with 
relatively short interval.
If you see bus scaling driver function devfreq_monitor() in 
drivers/devfreq/devfreq.c, it polls hardware every configured interval 
by using deferrable delayed work.  I'm not quite familiar with bus 
scaling so cc'ing linux-pm.

My guess is that ever since we have timer refactoring, above driver is 
polling every configured jiffy + 1 most of time.
When CONFIG_HZ_100=y and the interval is 1, polling will be happening 
every 20ms rather than configured 10ms which is 100% later than ideal.

If that kind of drivers want to run periodic polling at similar level of 
accuracy like pre v4.8, each drivers have to switch to hrtimer but there 
are problems apart from the fact there is no nicely written deferred 
processing mechanism like workqueue with hrtimer -
1) there is no deferrable hrtimer.
2) hrtimer has more overhead more than low res timer, especially hrtimer 
will fire interrupt for individual timer lists which will cause power 
impact.

It also makes sense to me that queued timer especially with long delay 
is tolerable to inaccuracy especially when most of them got canceled 
prior to its expiry time.
But by drivers which use timer as polling mechanism which never cancel 
it, IMHO this behaviour change could be a regression.

Thanks,
Joonwoo

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-10  1:32   ` Joonwoo Park
@ 2016-11-10 10:07     ` Thomas Gleixner
  2016-11-12  1:55       ` Saravana Kannan
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2016-11-10 10:07 UTC (permalink / raw)
  To: Joonwoo Park
  Cc: John Stultz, Eric Dumazet, Frederic Weisbecker, Linus Torvalds,
	Paul E. McKenney, Peter Zijlstra, linux-kernel, linux-pm

On Wed, 9 Nov 2016, Joonwoo Park wrote:
> On 11/09/2016 01:56 AM, Thomas Gleixner wrote:
> > That's simply wrong. We guarantee that the timer sleeps for at least a
> > jiffy. So in case of the first wheel we _must_ increment by one simply
> > because the next jiffie might be immanent and not doing so would expire the
> > timer early.
> > 
> > The wheel is not meant to be accurate at all and I really don't want an
> > extra conditional in that path just to make it accurate for some expiry
> > values. That's a completely pointless exercise.
> 
> I understand it's not meant to provide much of accuracy and also don't really
> care about accuracy of sporadic timer aim and fire.
> What I'm worried about is case that relies on periodic timer with relatively
> short interval.
> If you see bus scaling driver function devfreq_monitor() in
> drivers/devfreq/devfreq.c, it polls hardware every configured interval by
> using deferrable delayed work.  I'm not quite familiar with bus scaling so
> cc'ing linux-pm.

So you are not familiar with that and therefor use it as an argument for a
potential problem. Pretty scientific approach.

You already mentioned the keyword: DEFERRABLE, which is designed to be
inaccurate.
 
> My guess is that ever since we have timer refactoring, above driver is polling
> every configured jiffy + 1 most of time.

Sure, guess work is the right approach here.

> When CONFIG_HZ_100=y and the interval is 1, polling will be happening every
> 20ms rather than configured 10ms which is 100% later than ideal.

The wheel guarantees that the timer sleeps at least 1 jiffie and this can
only be done by adding one jiffie simply because:

CPU 0	   	CPU 1
		mod_timer(jiffies + 1)
timer irq
jiffies++	queue_timer
		timer_irq
		expire timer

So this timer expired exactly a few micro seconds after arming and therefor
violates the guarantee of firing not before the specified interval.

So depending on when you arm the timer the expiry is going to be:

   1 jiffie <= expiry <= 2 jiffies

If you disable high resolution timers then your user space program using
nanosleep will have exactly the same behaviour.

And this is the only sane behaviour you can expect from timers: To not
expire early.

Every user must cope with late expiry anyway as there is no guarantee that
the task is scheduled right away. Neither is there a guarantee that the
softirq (it might be deferred to ksoftirqd) invokes the callback on time.

The timer wheel is optimized for enqueue/dequeue speed and implicit
batching. Making it artificial accurate for one particular case is just
adding pointless overhead and aside of that it's going to violate the valid
assumption that a 1 jiffie sleep is going to sleep at least 1 jiffie.

> If that kind of drivers want to run periodic polling at similar level of
> accuracy like pre v4.8, each drivers have to switch to hrtimer but there are
> problems apart from the fact there is no nicely written deferred processing
> mechanism like workqueue with hrtimer -
> 1) there is no deferrable hrtimer.
> 2) hrtimer has more overhead more than low res timer, especially hrtimer will
> fire interrupt for individual timer lists which will cause power impact.

Deferrable timers shouldn't have been invented in the first place and yes,
they are not going to happen on hrtimers, quite the contrary, I'm working
on eliminating them completely.
 
> It also makes sense to me that queued timer especially with long delay is
> tolerable to inaccuracy especially when most of them got canceled prior to its
> expiry time.
> But by drivers which use timer as polling mechanism which never cancel it,
> IMHO this behaviour change could be a regression.

When you can come up with a real regression caused by the rework and not
just some handwaving theories then we can revisit that, but until then it's
going to stay as is.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-10 10:07     ` Thomas Gleixner
@ 2016-11-12  1:55       ` Saravana Kannan
  2016-11-12 11:25         ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Saravana Kannan @ 2016-11-12  1:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joonwoo Park, John Stultz, Eric Dumazet, Frederic Weisbecker,
	Linus Torvalds, Paul E. McKenney, Peter Zijlstra, linux-kernel,
	linux-pm

On 11/10/2016 02:07 AM, Thomas Gleixner wrote:
> On Wed, 9 Nov 2016, Joonwoo Park wrote:
<SNIP>

> So this timer expired exactly a few micro seconds after arming and therefor
> violates the guarantee of firing not before the specified interval.
>
> So depending on when you arm the timer the expiry is going to be:
>
>     1 jiffie <= expiry <= 2 jiffies
>
> If you disable high resolution timers then your user space program using
> nanosleep will have exactly the same behaviour.
>
> And this is the only sane behaviour you can expect from timers: To not
> expire early.

I completely understand the why you did the +1 jiffy. If you don't know 
at which point between two jiffies the timer was set up, you'll have to 
assume the worst case that the timer was set up just before the jiffy 
rolls over. That why you are doing the +1. Makes sense.

> Every user must cope with late expiry anyway as there is no guarantee that
> the task is scheduled right away. Neither is there a guarantee that the
> softirq (it might be deferred to ksoftirqd) invokes the callback on time.
>
> The timer wheel is optimized for enqueue/dequeue speed and implicit
> batching. Making it artificial accurate for one particular case is just
> adding pointless overhead and aside of that it's going to violate the valid
> assumption that a 1 jiffie sleep is going to sleep at least 1 jiffie.
>
>> If that kind of drivers want to run periodic polling at similar level of
>> accuracy like pre v4.8, each drivers have to switch to hrtimer but there are
>> problems apart from the fact there is no nicely written deferred processing
>> mechanism like workqueue with hrtimer -
>> 1) there is no deferrable hrtimer.
>> 2) hrtimer has more overhead more than low res timer, especially hrtimer will
>> fire interrupt for individual timer lists which will cause power impact.
>
> Deferrable timers shouldn't have been invented in the first place and yes,
> they are not going to happen on hrtimers, quite the contrary, I'm working
> on eliminating them completely.

If you do that, how exactly do you propose drivers do periodic polling 
while the Linux isn't idling, but stop polling when Linux is idle? 
devfreq is a classic example. devfreq is used in a lot of mobile 
devices. Across different vendors, devfreq is used for scaling system 
memory, flash storage, GPU, etc. You are going to kill power if you 
remove deferrable timers without having an alternate mechanism to solve 
this requirement.

For example, when you are browsing on your phone and reading something 
on screen (but not interacting with the device), the 
CPUs/clusters/caches go idle for long periods (several seconds) of time. 
If you remove deferrable timers, you are going to force a CPU wake up 
every 10ms or 50ms or whatever it's configured for.

>> It also makes sense to me that queued timer especially with long delay is
>> tolerable to inaccuracy especially when most of them got canceled prior to its
>> expiry time.
>> But by drivers which use timer as polling mechanism which never cancel it,
>> IMHO this behaviour change could be a regression.
>
> When you can come up with a real regression caused by the rework and not
> just some handwaving theories then we can revisit that, but until then it's
> going to stay as is.

If the polling interval isn't accurate, the regression isn't so much 
about the timer being inaccurate, but more so in the fact that it'll 
take that much longer to react and increase the device frequency. Frame 
rendering time is 16ms. If you react after 20ms instead of 10ms, that's 
way past a frame rendering time. System memory frequency matters a lot 
for frame rendering too.

Thanks,
Saravana

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-12  1:55       ` Saravana Kannan
@ 2016-11-12 11:25         ` Thomas Gleixner
  2016-11-13 23:26           ` skannan
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2016-11-12 11:25 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Joonwoo Park, John Stultz, Eric Dumazet, Frederic Weisbecker,
	Linus Torvalds, Paul E. McKenney, Peter Zijlstra, linux-kernel,
	linux-pm

On Fri, 11 Nov 2016, Saravana Kannan wrote:
> On 11/10/2016 02:07 AM, Thomas Gleixner wrote:
> > Deferrable timers shouldn't have been invented in the first place and yes,
> > they are not going to happen on hrtimers, quite the contrary, I'm working
> > on eliminating them completely.
> 
> If you do that, how exactly do you propose drivers do periodic polling while
> the Linux isn't idling, but stop polling when Linux is idle? devfreq is a
> classic example. devfreq is used in a lot of mobile devices. Across different
> vendors, devfreq is used for scaling system memory, flash storage, GPU, etc.
> You are going to kill power if you remove deferrable timers without having an
> alternate mechanism to solve this requirement.
> 
> For example, when you are browsing on your phone and reading something on
> screen (but not interacting with the device), the CPUs/clusters/caches go idle
> for long periods (several seconds) of time. If you remove deferrable timers,
> you are going to force a CPU wake up every 10ms or 50ms or whatever it's
> configured for.

I know how all that works, but there are other ways to deal with those
timers. We had complaints in the past because those timers can be stuck
forever on a fully idle cpu and that's not a good thing either. The whole
concept is ill defined or not defined at all, lacks any form of sane
semantics and was shoved into the kernel w/i much thought. In hindsight I
should have rejected that deferrable mess 5+ years ago.
 
> > When you can come up with a real regression caused by the rework and not
> > just some handwaving theories then we can revisit that, but until then it's
> > going to stay as is.
> 
> If the polling interval isn't accurate, the regression isn't so much about the
> timer being inaccurate, but more so in the fact that it'll take that much
> longer to react and increase the device frequency. Frame rendering time is
> 16ms. If you react after 20ms instead of 10ms, that's way past a frame
> rendering time. System memory frequency matters a lot for frame rendering too.

If you need precise timers use hrtimers and not a tick based mechanism,
it's that simple.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] timers: Fix timer inaccuracy
  2016-11-12 11:25         ` Thomas Gleixner
@ 2016-11-13 23:26           ` skannan
  0 siblings, 0 replies; 7+ messages in thread
From: skannan @ 2016-11-13 23:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joonwoo Park, John Stultz, Eric Dumazet, Frederic Weisbecker,
	Linus Torvalds, Paul E. McKenney, Peter Zijlstra, linux-kernel,
	linux-pm

On 2016-11-12 03:25, Thomas Gleixner wrote:
> On Fri, 11 Nov 2016, Saravana Kannan wrote:
>> On 11/10/2016 02:07 AM, Thomas Gleixner wrote:
>> > Deferrable timers shouldn't have been invented in the first place and yes,
>> > they are not going to happen on hrtimers, quite the contrary, I'm working
>> > on eliminating them completely.
>> 
>> If you do that, how exactly do you propose drivers do periodic polling 
>> while
>> the Linux isn't idling, but stop polling when Linux is idle? devfreq 
>> is a
>> classic example. devfreq is used in a lot of mobile devices. Across 
>> different
>> vendors, devfreq is used for scaling system memory, flash storage, 
>> GPU, etc.
>> You are going to kill power if you remove deferrable timers without 
>> having an
>> alternate mechanism to solve this requirement.
>> 
>> For example, when you are browsing on your phone and reading something 
>> on
>> screen (but not interacting with the device), the CPUs/clusters/caches 
>> go idle
>> for long periods (several seconds) of time. If you remove deferrable 
>> timers,
>> you are going to force a CPU wake up every 10ms or 50ms or whatever 
>> it's
>> configured for.
> 
> I know how all that works, but there are other ways to deal with those
> timers.

I won't pretend to be a timer expert. So, could you please let me know 
what you had in mind? The other ideas that have been thrown around are 
things like cancelling the timers in the idle path of the last CPU, etc. 
But that has a lot of problems associated with it and doesn't seem like 
an elegant solution to me.

> We had complaints in the past because those timers can be stuck
> forever on a fully idle cpu and that's not a good thing either.

Agreed.

> The whole
> concept is ill defined or not defined at all, lacks any form of sane
> semantics and was shoved into the kernel w/i much thought. In hindsight 
> I
> should have rejected that deferrable mess 5+ years ago.

At least as an end user the semantics I expected and feel the semantics 
needs to be: The timer needs to fire similar to any other timer, but if 
the all the CPUs running this instance of the Linux kernel are idle, 
then don't fire the timer till one of them wakes up.

>> > When you can come up with a real regression caused by the rework and not
>> > just some handwaving theories then we can revisit that, but until then it's
>> > going to stay as is.
>> 
>> If the polling interval isn't accurate, the regression isn't so much 
>> about the
>> timer being inaccurate, but more so in the fact that it'll take that 
>> much
>> longer to react and increase the device frequency. Frame rendering 
>> time is
>> 16ms. If you react after 20ms instead of 10ms, that's way past a frame
>> rendering time. System memory frequency matters a lot for frame 
>> rendering too.
> 
> If you need precise timers use hrtimers and not a tick based mechanism,
> it's that simple.

Which goes back to the point that hrtimers will always wake up the CPU, 
but that's something we don't want.

To summarize, mobile devices have needs for timers that are generally 
accurate but also don't wake up the CPUs just to run them (something 
with the semantics of a deferrable hrtimer but not necessarily that 
implementation). Unbound deferrable timers did/do that, but had a 
problem of getting stuck on a CPU that's idle for a long time while the 
rest of the CPUs are up. But Joonwoo's fix for the "getting stuck" part 
had valid concerns from TGLX that it could cause cache thrashing. But 
the current fix makes the timers inaccurate (but again, I agree with the 
reasoning for the +1 jiffy). So, it looks likee there's currently no 
solution in the kernel for the above requirements and removing 
deferrable timers would make it even worse for an end mobile product.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-13 23:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-09  9:36 [PATCH] timers: Fix timer inaccuracy Joonwoo Park
2016-11-09  9:56 ` Thomas Gleixner
2016-11-10  1:32   ` Joonwoo Park
2016-11-10 10:07     ` Thomas Gleixner
2016-11-12  1:55       ` Saravana Kannan
2016-11-12 11:25         ` Thomas Gleixner
2016-11-13 23:26           ` skannan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).