linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/20] timer: Refactor the timer wheel
@ 2016-06-13  8:40 Thomas Gleixner
  2016-06-13  8:40 ` [patch 01/20] timer: Make pinned a timer property Thomas Gleixner
                   ` (21 more replies)
  0 siblings, 22 replies; 62+ messages in thread
From: Thomas Gleixner @ 2016-06-13  8:40 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul E. McKenney, Eric Dumazet,
	Frederic Weisbecker, Chris Mason, Arjan van de Ven, rt

The current timer wheel has some drawbacks:

1) Cascading

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration.

2) No fast lookup of the next expiring timer

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to current jiffies. As we have no way to
   find the next expiring timer fast, the code loops and increments the base
   time by one and checks for expired timers in each step. I've observed loops
   lasting 1 ms!

There are some other issues caused by the above, but they are minor compare to
those.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kind of timeouts the
accuracy is not really a concern. In case the timeout fires, performance is
down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the batching
    which is done for NOHZ.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or cancelation.

Contrary to the classic wheel the granularities of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel. So for
HZ=250 we end up with the following granularity levels:

Level Offset  Granularity            Range
  0   0          4 ms                 0 ms -        252 ms
  1  64         32 ms               256 ms -       2044 ms (256ms - ~2s)
  2 128        256 ms              2048 ms -      16380 ms (~2s - ~16s)
  3 192       2048 ms (~2s)       16384 ms -     131068 ms (~16s - ~2m)
  4 256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m - ~17m)
  5 320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m - ~2h)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level. 

So the new wheel concept addresses the old issues:

1) Cascading is avoided (except for extreme long time timers)

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefor can
   lookup the next expiring timer fast and time bound.

A further benefit of the concept is, that the slack calculation which is done
on every timer start is not longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This series has also preparatory patches for changing the NOHZ timer handling
from the current push to a pull model. Currently we decide at timer enqueue
time on which cpu we queue the timer. This is exceptionally silly because
there is no way to predict at enqueue time which cpu will be idle when the
timer expires. Given the fact that most timers are canceled or rearmed before
expiry this is even more silly. We trade a expensive decision and cross cpu
access for a very doubtful benefit.

The solution to this is to store the migrateable timers in a seperate storage
space on the local cpu and when the cpu goes idle tell the others about its
next expiring timer and let the other cpu pull the timer in case of expiry. We
have a proof of concept implementation for this, but it's not yet ready for
posting.

Thanks,

	tglx
---
 arch/x86/kernel/apic/x2apic_uv_x.c  |    4 
 arch/x86/kernel/cpu/mcheck/mce.c    |    4 
 block/genhd.c                       |    5 
 drivers/cpufreq/powernv-cpufreq.c   |    5 
 drivers/mmc/host/jz4740_mmc.c       |    2 
 drivers/net/ethernet/tile/tilepro.c |    4 
 drivers/power/bq27xxx_battery.c     |    5 
 drivers/tty/metag_da.c              |    4 
 drivers/tty/mips_ejtag_fdc.c        |    4 
 drivers/usb/host/ohci-hcd.c         |    1 
 drivers/usb/host/xhci.c             |    2 
 include/linux/list.h                |   10 
 include/linux/timer.h               |   30 
 include/trace/events/timer.h        |   11 
 kernel/time/tick-internal.h         |    1 
 kernel/time/tick-sched.c            |   46 -
 kernel/time/timer.c                 | 1090 +++++++++++++++++++++---------------
 lib/random32.c                      |    1 
 net/ipv4/inet_connection_sock.c     |    7 
 net/ipv4/inet_timewait_sock.c       |    5 
 20 files changed, 731 insertions(+), 510 deletions(-)

^ permalink raw reply	[flat|nested] 62+ messages in thread
* Re: [patch 13/20] timer: Switch to a non cascading wheel
@ 2016-06-14  8:16 George Spelvin
  2016-06-14  8:50 ` Thomas Gleixner
  2016-06-14 10:20 ` Peter Zijlstra
  0 siblings, 2 replies; 62+ messages in thread
From: George Spelvin @ 2016-06-14  8:16 UTC (permalink / raw)
  To: tglx; +Cc: edumazet, linux, linux-kernel, peterz, richardcochran

Nice cleanup!


I think I see a buglet in your level-5 cascading.

Suppose a timer is requested far in the future for a time
that is an exact multiple of 32768 jiffies.

collect_expired_timers() scans level 5 after all the previous ones,
and will cascade it to level 0, in a level-0 bucket which has already
been scanned, and won't be scanned again for 64 jiffies.

I agree that 64 jiffies is well within your allowed rounding accuracy,
and order of timer firing is not guaranteed when they're for the same
time, but it is a bit odd when a timer fires 32 jiffies *before* another
timer scheduled for 32 jiffies later.  That's the sort of peculiarity
that could lead to a subtle bug.


While I like the cleanup of just limiting long-term resolution, if
it turns out to be necessary, it's not too hard to add exact timers
back in if a need is found in future.  All you need is a second
__internal_add_timer function that rounds down rather than up, and to
teach expire_timers() to cascade in the unlikely situation that a timer
does have an expiry time in the future.

(It also gets rid of the special case for level 5.)


Other, mostly minor, code comments:

> +/* Level offsets in the wheel */
> +#define LVL0_OFFS	(0)
> +#define LVL1_OFFS	(LVL_SIZE)
> +#define LVL2_OFFS	(LVL1_OFFS + LVL_SIZE)
> +#define LVL3_OFFS	(LVL2_OFFS + LVL_SIZE)
> +#define LVL4_OFFS	(LVL3_OFFS + LVL_SIZE)
> +#define LVL5_OFFS	(LVL4_OFFS + LVL_SIZE)
> +
> +/* Clock divisor for the next level */
> +#define LVL_CLK_SHIFT	3
> +#define LVL_CLK_DIV	(1 << LVL_CLK_SHIFT)
> +#define LVL_CLK_MASK	(LVL_CLK_DIV - 1)
> +
> +/* The shift constants for selecting the bucket at the levels */
> +#define LVL1_SHIFT	(1 * LVL_CLK_SHIFT)
> +#define LVL2_SHIFT	(2 * LVL_CLK_SHIFT)
> +#define LVL3_SHIFT	(3 * LVL_CLK_SHIFT)
> +#define LVL4_SHIFT	(4 * LVL_CLK_SHIFT)
> +#define LVL5_SHIFT	(5 * LVL_CLK_SHIFT)
> +
> +/* The granularity of each level */
> +#define LVL0_GRAN	0x00000001
> +#define LVL1_GRAN	(LVL0_GRAN << LVL_CLK_SHIFT)
> +#define LVL2_GRAN	(LVL1_GRAN << LVL_CLK_SHIFT)
> +#define LVL3_GRAN	(LVL2_GRAN << LVL_CLK_SHIFT)
> +#define LVL4_GRAN	(LVL3_GRAN << LVL_CLK_SHIFT)
> +#define LVL5_GRAN	(LVL4_GRAN << LVL_CLK_SHIFT)

Wouldn't this all be so much simpler as

#define LVL_BITS	6	/* Renamed previous LVL_SHIFT */
#define LVL_SIZE	(1 << LVL_BITS)
#define LVL_MASK	(LVL_BITS - 1)
#define LVL_OFFS(n)	((n) * LVL_SIZE)
#define LVL_SHIFT(n)	((n) * LVL_CLK_SHIFT)
#define LVL_GRAN(n)	(1 << LVL_SHIFT(n))

Then you could do
+static inline unsigned calc_index(unsigned expires, unsigned level),
+{
+	/* Round up to next bin bin */
+	expires = ((expires - 1) >> LVL_SHIFT(level)) + 1;
+	return LVL_OFFS(level) + (expires & LVL_MASK);
+}


> +#define LVL1_TSTART	(LVL_SIZE - 1)

Er... isn't that LVL_SIZE, as documented in the table above?
Then it could be
#define LVL_TSTART(n) (LVL_SIZE << LVL_SHIFT(n))

Ideally, you'd like all of that

+	if (delta < LVL1_TSTART) {
+		idx = (expires + LVL0_GRAN) & LVL_MASK;
+	} else if (delta < LVL2_TSTART) {
+		idx = calc_index(expires, LVL1_GRAN, LVL1_SHIFT, LVL1_OFFS);
+	} else if (delta < LVL3_TSTART) {
+		idx = calc_index(expires, LVL2_GRAN, LVL2_SHIFT, LVL2_OFFS);
+	} else if (delta < LVL4_TSTART) {
+		idx = calc_index(expires, LVL3_GRAN, LVL3_SHIFT, LVL3_OFFS);
+	} else if (delta < LVL5_TSTART) {
+		idx = calc_index(expires, LVL4_GRAN, LVL4_SHIFT, LVL4_OFFS);

to be replaced with __builtin_clz or similar:

	level = __fls(delta | LVL_MASK);
	if (level <  LVL_BITS + LVL_SHIFT(LVL_DEPTH-1)) {	/* or LVL_DEPTH-2, no difference */
		level = (level + LVL_CLK_SHIFT - LVL_BITS) / LVL_CLK_SHIFT;
	} else if ((long)delta < 0) {
		expires = base->clk;
		level = 0;
	} else {
		level = LVL_DEPTH - 1;
	}
	index = calc_index(expires, level);


> +static inline void detach_expired_timer(struct timer_list *timer)
>  {
>  	detach_timer(timer, true);
> -	if (!(timer->flags & TIMER_DEFERRABLE))
> -		base->active_timers--;
> -	base->all_timers--;
>  }

Is there even a reason to have this wrapper any more?  Why not
just replace all calls to it in the source?


> +		timer = hlist_entry(head->first, struct timer_list, entry);
> +		fn = timer->function;
> +		data = timer->data;
> +
> +		timer_stats_account_timer(timer);
> +
> +		base->running_timer = timer;
> +		detach_expired_timer(timer);

Is there some non-obvious reason that you have to fetch fn and data
so early?  It seems like a register pressure pessimization, if the
compiler can't figure out that timer_stats code can't change them.

The cache line containing this timer was already prefetched when you
updated its entry.pprev as part of removing the previous entry from
the list.

I see why you want to fetch them with the lock held in case there's some
freaky race, but I'd do it all after detach_timer().

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2016-06-17  4:05 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13  8:40 [patch 00/20] timer: Refactor the timer wheel Thomas Gleixner
2016-06-13  8:40 ` [patch 01/20] timer: Make pinned a timer property Thomas Gleixner
2016-06-13  8:40 ` [patch 02/20] x86/apic/uv: Initialize timer as pinned Thomas Gleixner
2016-06-13  8:40 ` [patch 03/20] x86/mce: " Thomas Gleixner
2016-06-13  8:40 ` [patch 04/20] cpufreq/powernv: " Thomas Gleixner
2016-06-13 13:18   ` Arjan van de Ven
2016-06-13  8:40 ` [patch 05/20] driver/net/ethernet/tile: " Thomas Gleixner
2016-06-13  8:40 ` [patch 06/20] drivers/tty/metag_da: " Thomas Gleixner
2016-06-13 13:13   ` Arjan van de Ven
2016-06-13  8:40 ` [patch 07/20] drivers/tty/mips_ejtag: " Thomas Gleixner
2016-06-13  8:40 ` [patch 08/20] net/ipv4/inet: Initialize timers " Thomas Gleixner
2016-06-13  8:40 ` [patch 09/20] timer: Remove mod_timer_pinned Thomas Gleixner
2016-06-13  8:40 ` [patch 10/20] timer: Add a cascading tracepoint Thomas Gleixner
2016-06-13  8:40 ` [patch 11/20] hlist: Add hlist_is_last_node() helper Thomas Gleixner
2016-06-13 10:27   ` Paolo Bonzini
2016-06-13  8:40 ` [patch 12/20] timer: Give a few structs and members proper names Thomas Gleixner
2016-06-13  8:41 ` [patch 13/20] timer: Switch to a non cascading wheel Thomas Gleixner
2016-06-13 11:40   ` Peter Zijlstra
2016-06-13 12:30     ` Thomas Gleixner
2016-06-13 12:46       ` Eric Dumazet
2016-06-13 14:30         ` Thomas Gleixner
2016-06-14 10:11       ` Ingo Molnar
2016-06-14 16:28         ` Thomas Gleixner
2016-06-14 17:14           ` Arjan van de Ven
2016-06-14 18:05             ` Thomas Gleixner
2016-06-14 20:34               ` Peter Zijlstra
2016-06-14 20:42               ` Peter Zijlstra
2016-06-14 21:17                 ` Eric Dumazet
2016-06-15 14:53                   ` Thomas Gleixner
2016-06-15 14:55                     ` Arjan van de Ven
2016-06-15 16:43                       ` Thomas Gleixner
2016-06-16 15:43                         ` Thomas Gleixner
2016-06-16 16:02                           ` Paul E. McKenney
2016-06-16 18:14                             ` Peter Zijlstra
2016-06-17  0:40                               ` Paul E. McKenney
2016-06-17  4:04                                 ` Paul E. McKenney
2016-06-16 16:04                           ` Arjan van de Ven
2016-06-16 16:09                             ` Thomas Gleixner
2016-06-15 15:05                     ` Eric Dumazet
2016-06-13 14:36   ` Richard Cochran
2016-06-13 14:39     ` Thomas Gleixner
2016-06-13  8:41 ` [patch 15/20] timer: Move __run_timers() function Thomas Gleixner
2016-06-13  8:41 ` [patch 14/20] timer: Remove slack leftovers Thomas Gleixner
2016-06-13  8:41 ` [patch 16/20] timer: Optimize collect timers for NOHZ Thomas Gleixner
2016-06-13  8:41 ` [patch 17/20] tick/sched: Remove pointless empty function Thomas Gleixner
2016-06-13  8:41 ` [patch 18/20] timer: Forward wheel clock whenever possible Thomas Gleixner
2016-06-13 15:14   ` Richard Cochran
2016-06-13 15:18     ` Thomas Gleixner
2016-06-13  8:41 ` [patch 19/20] timer: Split out index calculation Thomas Gleixner
2016-06-13  8:41 ` [patch 20/20] timer: Optimization for same expiry time in mod_timer() Thomas Gleixner
2016-06-13 14:10 ` [patch 00/20] timer: Refactor the timer wheel Eric Dumazet
2016-06-13 16:15 ` Paul E. McKenney
2016-06-15 15:15   ` Paul E. McKenney
2016-06-15 17:02     ` Thomas Gleixner
2016-06-15 20:26       ` Paul E. McKenney
2016-06-14  8:16 [patch 13/20] timer: Switch to a non cascading wheel George Spelvin
2016-06-14  8:50 ` Thomas Gleixner
2016-06-14 10:15   ` George Spelvin
2016-06-14 10:20 ` Peter Zijlstra
2016-06-14 12:58   ` George Spelvin
2016-06-14 16:48     ` Thomas Gleixner
2016-06-14 19:56       ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).