linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch 2/7] timer: Remove FIFO guarantee
@ 2015-06-05 22:27 George Spelvin
  2015-06-08 17:48 ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: George Spelvin @ 2015-06-05 22:27 UTC (permalink / raw)
  To: tglx; +Cc: linux, linux-kernel, viresh.kumar

Two thoughts:

1) It's not clear that timer slack has violated the FIFO guarantee.

   Remember, the guarantee only applies when two timers have the same
   (absolute) timeout; i.e. are requested for the same time.
   
   For two entries with the same timeout, applying slack will produce
   the same adjusted timeout.  That's the whole point of slack; to
   force the timeouts to collide more.

   Because the unadjusted timeouts aren't stored, timeout ordering
   can be violated; i.e. timers for time t and t+1, which round to the
   same time ater slack is applied, will expire in FIFO order rather
   than timeout order.  This is non-

   But in the case where he FIFO guatantee used to exist, I don't think
   timer slack broke it.


   I'm not disagreeing with the change, but it's not clear to me that
   it's as safe as you think.


2) If you want to do some more in that area, one thing I've been meaning
   to get around to is eliminating the whole round_jiffies() system.

   It does basically the same thing as the slack system, although with
   less flexibility, and it would be wonderful to rip it out of
   the kernel completely.


Additional rambling you should ignore.  It exists because I haven't
figured out why it's impractical yet.

An interesting variant on the slack system would be to apply slack in the
wakeup loop rather than before sleeping.  It might permit more bundling
and thus fewer wakeups.

You have the timers in original order.  But computing the next expiration
is more complex.  Each timer has a minimum and maximum wakeup time, and
they're sorted by minimum time.  You scan the list of pending timers,
accumulating a "batch" which can all be woken up together.

You keep intersecting each new timer's wakeup interval with the pending
batch, until the batch maximum is less than the next timer's minimum.

The final result is a range of permissible timeouts, from which one is
chosen as the processor wakeup time.  I'd be inclined to err on the low
side, but whatever works.

(It's also permissible to limit the size of a batch arbitrarily.
After scanning N timers, just pick a wakeup time no later than
the N+1st timer's minimum time and stop.)

When a new timer is added, there are three cases:
1) (Common) Its minimum timeout is later than the current pending wakeup.
   Add it to the pending queue normally.
2) Its wakeup range includes the pending time.  Add it to the batch.
3) Its maximum wakeup time is less than the pending time.  Reschedule
   the hardware timer.  When it goes off, scan the batch for additional
   timers that can be fired now, before building the new batch.

The idea is that each timer would be scanned twice: once when
building the batch, and a second time when running it.

Except that's not true if the third case happens frequently, and
batches keep getting preempted and re-scanned.

But there's an easy workaround to avoid O(n^2): at the expense of
possibly non-optimal wakeups, if what's left of the old batch is larger
than reasonable to rescan, just remember the old wakeup range (the
minimum will be accurate, because it's the minimum of the last timer
in the batch, which is still pending, but the maximum might be wrong)
and extend the batch without recomputing the maximum.

There's probably some reason this can't be made to work, but I wanted
to do a brain dump.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-05 22:27 [patch 2/7] timer: Remove FIFO guarantee George Spelvin
@ 2015-06-08 17:48 ` Thomas Gleixner
  2015-06-09  5:39   ` George Spelvin
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2015-06-08 17:48 UTC (permalink / raw)
  To: George Spelvin; +Cc: LKML, viresh.kumar, Peter Zijlstra

On Sat, 5 Jun 2015, George Spelvin wrote:

> Two thoughts:
> 
> 1) It's not clear that timer slack has violated the FIFO guarantee.
> 
>    Remember, the guarantee only applies when two timers have the same
>    (absolute) timeout; i.e. are requested for the same time.
>    
>    For two entries with the same timeout, applying slack will produce
>    the same adjusted timeout.  That's the whole point of slack; to
>    force the timeouts to collide more.
> 
>    Because the unadjusted timeouts aren't stored, timeout ordering
>    can be violated; i.e. timers for time t and t+1, which round to the
>    same time ater slack is applied, will expire in FIFO order rather
>    than timeout order.  This is non-
> 
>    But in the case where he FIFO guatantee used to exist, I don't think
>    timer slack broke it.

It does. Depending on when you enqueue the timer because the thing is
calculated from the delta (expires - jiffies).

Now thinking more about it. Even before we introduced the slack, the
FIFO guarantee was only there, if two timers were enqueued into the
same array level. Assume the following:

       T1	expires = 257	enqueue time = 0
       T2	expires = 257	enqueue time = 200

So T1 will go into the secondary array and will be recascaded
into the primary array after 256 jiffies.

T2 will go into the primary array BEFORE T1 is recascaded. So T1 will
be queued into the same bucket at cascading AFTER T2.

Another issue is the whole NOHZ target cpu thing. Even if two timers
are enqueued at the same jiffie with the same expiry value, they can
end up on two different cpus. Due to tick skew, which can be explicit,
T2 can expire before T1.

>    I'm not disagreeing with the change, but it's not clear to me that
>    it's as safe as you think.

After thinking more about it, I'm even more sure that any code which
relies on the FIFO "guarantee" is broken today.
 
> 2) If you want to do some more in that area, one thing I've been meaning
>    to get around to is eliminating the whole round_jiffies() system.
> 
>    It does basically the same thing as the slack system, although with
>    less flexibility, and it would be wonderful to rip it out of
>    the kernel completely.

Once we have natural batching in the wheel itself. I think we can kill
that completely.
 
> Additional rambling you should ignore.  It exists because I haven't
> figured out why it's impractical yet.
> 
> An interesting variant on the slack system would be to apply slack in the
> wakeup loop rather than before sleeping.  It might permit more bundling
> and thus fewer wakeups.
> 
> You have the timers in original order.  But computing the next expiration
> is more complex.  Each timer has a minimum and maximum wakeup time, and
> they're sorted by minimum time.  You scan the list of pending timers,
> accumulating a "batch" which can all be woken up together.

Similar to what we do in hrtimers. Though the main issue of the timer
wheel is, that we have no fast way to find the next expiring
timer. Maybe we can revisit this once I got the rework of the wheel
logic completed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-08 17:48 ` Thomas Gleixner
@ 2015-06-09  5:39   ` George Spelvin
  2015-06-09  8:05     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: George Spelvin @ 2015-06-09  5:39 UTC (permalink / raw)
  To: linux, tglx; +Cc: linux-kernel, peterz, viresh.kumar

Thomas Gleixner wrote:
> It does. Depending on when you enqueue the timer because the thing is
> calculated from the delta (expires - jiffies).

Ah, right.  If slack > 0, the slack amount is absolute and the rounding
will be consistent.

But if slack < 0, which is the default, it's a percentage of remaining
jiffies.  Since slack only delays timeouts, an earlier-scheduled
timeout could easily be delayed more.

(There are only six calls to set_timer_slack() to change the default
to something positive in the kernel.)

>>    I'm not disagreeing with the change, but it's not clear to me that
>>    it's as safe as you think.

> After thinking more about it, I'm even more sure that any code which
> relies on the FIFO "guarantee" is broken today.

Indeed, I am completely convinced.  All I might request is a reassignment
of blame in the commit message.



Thank you for your comments on my other blue-sky ideas, too.

I need to look into why we're using wheels, and what the point is.
How much of an advantage do they have over an efficient priority queue
like a pairing heap?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-09  5:39   ` George Spelvin
@ 2015-06-09  8:05     ` Thomas Gleixner
  2015-06-09  9:43       ` George Spelvin
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2015-06-09  8:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-kernel, peterz, viresh.kumar

On Tue, 9 Jun 2015, George Spelvin wrote:
> Thomas Gleixner wrote:
> 
> > After thinking more about it, I'm even more sure that any code which
> > relies on the FIFO "guarantee" is broken today.
> 
> Indeed, I am completely convinced.  All I might request is a reassignment
> of blame in the commit message.

Will do. Thanks for spotting it!
 
> Thank you for your comments on my other blue-sky ideas, too.
> 
> I need to look into why we're using wheels, and what the point is.
> How much of an advantage do they have over an efficient priority queue
> like a pairing heap?

The only reason is performance. The wheel has O(1) insertion and
deletion time while heaps and trees usually have O(log(n)).

Timer wheel timers are usually timeouts and 99% of them are canceled
before expiry. Networking is probably the heaviest use case followed
by disk I/O.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-09  8:05     ` Thomas Gleixner
@ 2015-06-09  9:43       ` George Spelvin
  2015-06-09 10:24         ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: George Spelvin @ 2015-06-09  9:43 UTC (permalink / raw)
  To: linux, tglx; +Cc: linux-kernel, peterz, viresh.kumar

Thomas Gleixner wrote:
> The only reason is performance. The wheel has O(1) insertion and
> deletion time while heaps and trees usually have O(log(n)).

Pairing heaps also have O(1) insertion, and rescheduling can
be quite lightweight.

The issue I was worried about is the need for an additional pointer
per timer (left, right, parent) and the associated cache line touch
when modifying the heap.

> Timer wheel timers are usually timeouts and 99% of them are canceled
> before expiry. Networking is probably the heaviest use case followed
> by disk I/O.

And that rewards very lazy structures, that postpone work in the
hope it will become unnecessary.  But a wheel has real problems with
non-tick-based timers, which as you note covers both hrtimers and NOHZ.

Now, two things to note about pairing heaps (and many related heap
structures like Fibonacci heaps, but pairing heaps have a particularly
good constant factor in practice) are:

1) It is O(1) to "meld" two heaps into one.
2) The above is O(1) because it's literally "stick it on a linked list";
   the left child pointers are NULL until the minimum (which is kept
   at the head/root) is deleted and a new minimum has to be found.

Combining these two facts, we could do something wheel-like: divide
the future into a number of heaps, link future events into the correct
subheap, and meld the subheaps into the main heap when necessary.

Hopefully, by the time it's necessary, the subheap would have been
thinned out by timers being ccanceled.

On reflection, it wouldn't even be necessary to have explicit code to
handle the melding.  Just allocate an array of "internal use" nodes
which are easy to find, and place them in the main tree like
normal.  (Each has a timeout which is guaranteed to be earliest in
its subheap, so the subjeap will never need sorting.)

When one gets to the root, the internal node gets recycled (because
we set up the callback function to do that!) and the subheap gets
sorted and merged into the main heap automatically.

Alternatively, the internal use node could be made smaller (e.g.
an hlist head rather than a normal node) at the expense of needing
special-case code to handle it.

Have to think on this.  Heapifying the sublist is O(n) work, which is
the same as overflowing a bucket, but it means that additional deletions
will be more expensive.

Need to think on this.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-09  9:43       ` George Spelvin
@ 2015-06-09 10:24         ` Peter Zijlstra
  2015-06-09 11:15           ` George Spelvin
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-06-09 10:24 UTC (permalink / raw)
  To: George Spelvin; +Cc: tglx, linux-kernel, viresh.kumar

On Tue, Jun 09, 2015 at 05:43:18AM -0400, George Spelvin wrote:
> Thomas Gleixner wrote:
> > The only reason is performance. The wheel has O(1) insertion and
> > deletion time while heaps and trees usually have O(log(n)).
> 
> Pairing heaps also have O(1) insertion, and rescheduling can
> be quite lightweight.
> 
> The issue I was worried about is the need for an additional pointer
> per timer (left, right, parent) and the associated cache line touch
> when modifying the heap.
> 
> > Timer wheel timers are usually timeouts and 99% of them are canceled
> > before expiry. Networking is probably the heaviest use case followed
> > by disk I/O.
> 
> And that rewards very lazy structures, that postpone work in the
> hope it will become unnecessary.  But a wheel has real problems with
> non-tick-based timers, which as you note covers both hrtimers and NOHZ.
> 
> Now, two things to note about pairing heaps (and many related heap
> structures like Fibonacci heaps, but pairing heaps have a particularly
> good constant factor in practice) are:
> 
> 1) It is O(1) to "meld" two heaps into one.
> 2) The above is O(1) because it's literally "stick it on a linked list";
>    the left child pointers are NULL until the minimum (which is kept
>    at the head/root) is deleted and a new minimum has to be found.
> 
> Combining these two facts, we could do something wheel-like: divide
> the future into a number of heaps, link future events into the correct
> subheap, and meld the subheaps into the main heap when necessary.
> 
> Hopefully, by the time it's necessary, the subheap would have been
> thinned out by timers being ccanceled.

Right, so while pairing (and fibonacci) heaps have O(log n) deletion, if
you can delay heapification you can equally delay that extra cost.

But as you say, the heapification is a O(n) hit, which is not too
different from our current 'problem'.

But yes, interesting problem space this.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-06-09 10:24         ` Peter Zijlstra
@ 2015-06-09 11:15           ` George Spelvin
  0 siblings, 0 replies; 9+ messages in thread
From: George Spelvin @ 2015-06-09 11:15 UTC (permalink / raw)
  To: linux, peterz; +Cc: linux-kernel, tglx, viresh.kumar

Right, so while pairing (and fibonacci) heaps have O(log n) deletion, if
you can delay heapification you can equally delay that extra cost.

> But as you say, the heapification is a O(n) hit, which is not too
> different from our current 'problem'.

> But yes, interesting problem space this.

It definitely gets my mind going.

The issue that's occupying my thoughts right now is this...

Given a "sublist" for a range of future times, the minimum time
can be either a real timeout or a fake one.

If it's a real one, there's a chance you might delete the minimum timeout
on the list.  you have to handle re-sorting after deletion to maintain
a valid minimum time.

You can have a fake list head, which avoids ever re-sorting, but you're
bascially forcing a "tick": taking a timer interrupt at that time even
though nothing needs it.


One possibility that occurs to me would be to have fake heads on the
sublists, but recognize them when they get selected as the "next timeout".
If that happens, deletemin() them now from the remaining heap and find
next *real* timeout.

Basically, just like the current wheel, you'd have to do some scanning
past possibly-empty sublists, but perhaps by using a much more efficient
heap structure, the sublists could be made larger, so not as many heads
would be needed.


Radix sort isn't really O(n); it's O(n log N) with a small constant,
where N is the range of *possible* values (e.g. N = 2^32, so
log N = 32).  But if you want high-resolution timers, N goes
up a lot.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/7] timer: Remove FIFO guarantee
  2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
@ 2015-05-27  9:11   ` Viresh Kumar
  0 siblings, 0 replies; 9+ messages in thread
From: Viresh Kumar @ 2015-05-27  9:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul McKenney,
	Frederic Weisbecker, Eric Dumazet, John Stultz, Joonwoo Park,
	Wenbo Wang

On 26-05-15, 22:50, Thomas Gleixner wrote:
> The FIFO guarantee has been violated by the introduction of timer
> slack already. Remove it.
> 
> This is a preparatory patch for converting the timer wheel to hlist
> which reduces the memory foot print of the wheel by 50%. It's a
> seperate patch so any (unlikely to happen) regression caused by this
> can be identified clearly.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/time/timer.c |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> Index: tip/kernel/time/timer.c
> ===================================================================
> --- tip.orig/kernel/time/timer.c
> +++ tip/kernel/time/timer.c
> @@ -403,10 +403,8 @@ __internal_add_timer(struct tvec_base *b
>  		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
>  		vec = base->tv5.vec + i;
>  	}
> -	/*
> -	 * Timers are FIFO:
> -	 */
> -	list_add_tail(&timer->entry, vec);
> +
> +	list_add(&timer->entry, vec);
>  }

Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 2/7] timer: Remove FIFO guarantee
  2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
@ 2015-05-26 22:50 ` Thomas Gleixner
  2015-05-27  9:11   ` Viresh Kumar
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

[-- Attachment #1: timer-remove-fifo-guarantee.patch --]
[-- Type: text/plain, Size: 968 bytes --]

The FIFO guarantee has been violated by the introduction of timer
slack already. Remove it.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%. It's a
seperate patch so any (unlikely to happen) regression caused by this
can be identified clearly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -403,10 +403,8 @@ __internal_add_timer(struct tvec_base *b
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
-	/*
-	 * Timers are FIFO:
-	 */
-	list_add_tail(&timer->entry, vec);
+
+	list_add(&timer->entry, vec);
 }
 
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-09 11:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-05 22:27 [patch 2/7] timer: Remove FIFO guarantee George Spelvin
2015-06-08 17:48 ` Thomas Gleixner
2015-06-09  5:39   ` George Spelvin
2015-06-09  8:05     ` Thomas Gleixner
2015-06-09  9:43       ` George Spelvin
2015-06-09 10:24         ` Peter Zijlstra
2015-06-09 11:15           ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27  9:11   ` Viresh Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).