linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/20] timer: Refactor the timer wheel
@ 2016-06-13  8:40 Thomas Gleixner
  2016-06-13  8:40 ` [patch 01/20] timer: Make pinned a timer property Thomas Gleixner
                   ` (21 more replies)
  0 siblings, 22 replies; 55+ messages in thread
From: Thomas Gleixner @ 2016-06-13  8:40 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul E. McKenney, Eric Dumazet,
	Frederic Weisbecker, Chris Mason, Arjan van de Ven, rt

The current timer wheel has some drawbacks:

1) Cascading

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration.

2) No fast lookup of the next expiring timer

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to current jiffies. As we have no way to
   find the next expiring timer fast, the code loops and increments the base
   time by one and checks for expired timers in each step. I've observed loops
   lasting 1 ms!

There are some other issues caused by the above, but they are minor compare to
those.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kind of timeouts the
accuracy is not really a concern. In case the timeout fires, performance is
down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the batching
    which is done for NOHZ.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or cancelation.

Contrary to the classic wheel the granularities of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel. So for
HZ=250 we end up with the following granularity levels:

Level Offset  Granularity            Range
  0   0          4 ms                 0 ms -        252 ms
  1  64         32 ms               256 ms -       2044 ms (256ms - ~2s)
  2 128        256 ms              2048 ms -      16380 ms (~2s - ~16s)
  3 192       2048 ms (~2s)       16384 ms -     131068 ms (~16s - ~2m)
  4 256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m - ~17m)
  5 320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m - ~2h)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level. 

So the new wheel concept addresses the old issues:

1) Cascading is avoided (except for extreme long time timers)

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefor can
   lookup the next expiring timer fast and time bound.

A further benefit of the concept is, that the slack calculation which is done
on every timer start is not longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This series has also preparatory patches for changing the NOHZ timer handling
from the current push to a pull model. Currently we decide at timer enqueue
time on which cpu we queue the timer. This is exceptionally silly because
there is no way to predict at enqueue time which cpu will be idle when the
timer expires. Given the fact that most timers are canceled or rearmed before
expiry this is even more silly. We trade a expensive decision and cross cpu
access for a very doubtful benefit.

The solution to this is to store the migrateable timers in a seperate storage
space on the local cpu and when the cpu goes idle tell the others about its
next expiring timer and let the other cpu pull the timer in case of expiry. We
have a proof of concept implementation for this, but it's not yet ready for
posting.

Thanks,

	tglx
---
 arch/x86/kernel/apic/x2apic_uv_x.c  |    4 
 arch/x86/kernel/cpu/mcheck/mce.c    |    4 
 block/genhd.c                       |    5 
 drivers/cpufreq/powernv-cpufreq.c   |    5 
 drivers/mmc/host/jz4740_mmc.c       |    2 
 drivers/net/ethernet/tile/tilepro.c |    4 
 drivers/power/bq27xxx_battery.c     |    5 
 drivers/tty/metag_da.c              |    4 
 drivers/tty/mips_ejtag_fdc.c        |    4 
 drivers/usb/host/ohci-hcd.c         |    1 
 drivers/usb/host/xhci.c             |    2 
 include/linux/list.h                |   10 
 include/linux/timer.h               |   30 
 include/trace/events/timer.h        |   11 
 kernel/time/tick-internal.h         |    1 
 kernel/time/tick-sched.c            |   46 -
 kernel/time/timer.c                 | 1090 +++++++++++++++++++++---------------
 lib/random32.c                      |    1 
 net/ipv4/inet_connection_sock.c     |    7 
 net/ipv4/inet_timewait_sock.c       |    5 
 20 files changed, 731 insertions(+), 510 deletions(-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2016-06-17  4:05 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13  8:40 [patch 00/20] timer: Refactor the timer wheel Thomas Gleixner
2016-06-13  8:40 ` [patch 01/20] timer: Make pinned a timer property Thomas Gleixner
2016-06-13  8:40 ` [patch 02/20] x86/apic/uv: Initialize timer as pinned Thomas Gleixner
2016-06-13  8:40 ` [patch 03/20] x86/mce: " Thomas Gleixner
2016-06-13  8:40 ` [patch 04/20] cpufreq/powernv: " Thomas Gleixner
2016-06-13 13:18   ` Arjan van de Ven
2016-06-13  8:40 ` [patch 05/20] driver/net/ethernet/tile: " Thomas Gleixner
2016-06-13  8:40 ` [patch 06/20] drivers/tty/metag_da: " Thomas Gleixner
2016-06-13 13:13   ` Arjan van de Ven
2016-06-13  8:40 ` [patch 07/20] drivers/tty/mips_ejtag: " Thomas Gleixner
2016-06-13  8:40 ` [patch 08/20] net/ipv4/inet: Initialize timers " Thomas Gleixner
2016-06-13  8:40 ` [patch 09/20] timer: Remove mod_timer_pinned Thomas Gleixner
2016-06-13  8:40 ` [patch 10/20] timer: Add a cascading tracepoint Thomas Gleixner
2016-06-13  8:40 ` [patch 11/20] hlist: Add hlist_is_last_node() helper Thomas Gleixner
2016-06-13 10:27   ` Paolo Bonzini
2016-06-13  8:40 ` [patch 12/20] timer: Give a few structs and members proper names Thomas Gleixner
2016-06-13  8:41 ` [patch 13/20] timer: Switch to a non cascading wheel Thomas Gleixner
2016-06-13 11:40   ` Peter Zijlstra
2016-06-13 12:30     ` Thomas Gleixner
2016-06-13 12:46       ` Eric Dumazet
2016-06-13 14:30         ` Thomas Gleixner
2016-06-14 10:11       ` Ingo Molnar
2016-06-14 16:28         ` Thomas Gleixner
2016-06-14 17:14           ` Arjan van de Ven
2016-06-14 18:05             ` Thomas Gleixner
2016-06-14 20:34               ` Peter Zijlstra
2016-06-14 20:42               ` Peter Zijlstra
2016-06-14 21:17                 ` Eric Dumazet
2016-06-15 14:53                   ` Thomas Gleixner
2016-06-15 14:55                     ` Arjan van de Ven
2016-06-15 16:43                       ` Thomas Gleixner
2016-06-16 15:43                         ` Thomas Gleixner
2016-06-16 16:02                           ` Paul E. McKenney
2016-06-16 18:14                             ` Peter Zijlstra
2016-06-17  0:40                               ` Paul E. McKenney
2016-06-17  4:04                                 ` Paul E. McKenney
2016-06-16 16:04                           ` Arjan van de Ven
2016-06-16 16:09                             ` Thomas Gleixner
2016-06-15 15:05                     ` Eric Dumazet
2016-06-13 14:36   ` Richard Cochran
2016-06-13 14:39     ` Thomas Gleixner
2016-06-13  8:41 ` [patch 15/20] timer: Move __run_timers() function Thomas Gleixner
2016-06-13  8:41 ` [patch 14/20] timer: Remove slack leftovers Thomas Gleixner
2016-06-13  8:41 ` [patch 16/20] timer: Optimize collect timers for NOHZ Thomas Gleixner
2016-06-13  8:41 ` [patch 17/20] tick/sched: Remove pointless empty function Thomas Gleixner
2016-06-13  8:41 ` [patch 18/20] timer: Forward wheel clock whenever possible Thomas Gleixner
2016-06-13 15:14   ` Richard Cochran
2016-06-13 15:18     ` Thomas Gleixner
2016-06-13  8:41 ` [patch 19/20] timer: Split out index calculation Thomas Gleixner
2016-06-13  8:41 ` [patch 20/20] timer: Optimization for same expiry time in mod_timer() Thomas Gleixner
2016-06-13 14:10 ` [patch 00/20] timer: Refactor the timer wheel Eric Dumazet
2016-06-13 16:15 ` Paul E. McKenney
2016-06-15 15:15   ` Paul E. McKenney
2016-06-15 17:02     ` Thomas Gleixner
2016-06-15 20:26       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).