linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
@ 2015-05-26 22:50 Thomas Gleixner
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

I still have a couple of patches against the timer code in my review
list, but the more I look at them the more horrible I find them.

All these patches are related to the NOHZ stuff and try to mitigate
shortcomings with even more bandaids. And of course these bandaids
cost cycles and are a burden for timer heavy use cases like
networking.

Sadly enough the NOHZ crowd is happy to throw more and more crap at
the kernel and none of these people is even thinking about whether
this stuff could be solved in a different way.

After Eric mentioned in one of the recent discussions that the
timer_migration sysctl is not a lot of gain, I tried to mitigate at
least that issue. That caused me to look deeper and I came up with the
following patch series which:

  - Clean up the sloppy catchup timer jiffies stuff

  - Replace the hash bucket lists by hlists, which shrinks the wheel
    footprint by 50%

  - Replaces the timer base pointer in struct timer_list by a cpu
    index, which shrinks struct timer_list by 4 - 8 bytes depending on
    alignement and architecture.

  - Caches nohz_active and timer_migration in the timer per cpu bases,
    so we can avoid going through loops and hoops to figure that out.

This series applies against

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core

With a pretty naive timer test module and a sched other cpu hog on an
isolated core I verified that I did not wreckage anything. The
following table shows the resulting CPU time of the hog for the
various scenarios.

	     nohz=on	      	nohz=on			nohz=off
	     timer_migration=on	timer_migration=off	    

Unpatched:   46.57%		46.87%			46.64%

Patched:     47.76%		48.20%			48.73%

Though, I do not trust my numbers, so I really hope that Eric or any
other power timer wheel user can throw a bunch of tests at it.


Now some more rant about the whole NOHZ issues. The timer wheel is
fundamentally unfriendly for NOHZ because:

  - it's impossible to keep reliably track of the next expiring timer

  - finding the next expiring timer is in the worst case O(N) and
    involves touching a gazillion of cache lines

The other disaster area is the nohz timer migration mechanism. I think
it's fundamentally broken. 

 - We decide at timer enqueue time, on which CPU we queue the timer,
   based on cpu idle heuristics which even fails to recognize that a
   cpu is really busy with interrupt and softirq processing (reported
   by Eric).

 - When moving a timer to some "non-idle" CPU we speculate about the
   system situation in the future (at timer expiry time). This kinda
   works for limited scenarios (NOHZ_FULL and constraint power saving
   on mobile devices). But it is pretty much nonsensical for
   everything else. For network heavy workloads it can be even
   outright wrong and dangerous as Eric explained.

So we really need to put a full stop at all this NOHZ tinkering and
think about proper solutions. I'm not going to merge any NOHZ related
features^Whacks before the existing problems are not solved. In
hindsight I should have done that earlier, but it's never too late.

So we need to address two issues:

1) Keeping track of the first expiring timer in the wheel.

   Don't even think about rbtrees or such, it just wont work, but I'm
   willing to accept prove of the contrary.

   One of the ideas I had a long time ago is to have a wheel
   implementation, which does never cascade and therefor provides
   different levels of granularity per wheel level.

   LVL0	   1  jiffy granularity
   LVL1	   8  jiffies granularity
   LVL1	   64 jiffies granularity
   ....

   This requires more levels than the classic timer wheel, so its not
   feasible from a memory consumption POV.

   But we can have a compromise and put all 'soon' expiring timers
   into these fancy new wheels and stick all 'far out' timers into the
   last level of the wheel and cascade them occasionally.

   That should work because:

     - Most timers are short term expiry (< 1h)
     - Most timers are canceled before expiry

   So we need a sensible upper bound of levels and get the following
   properties:

   	- Natural batching (no more slack magic). This might also help
          networking to avoid rearming of timers.

	- Long out timers are queued in the slowest wheel and
          ocassionally with the granularity of the slowest wheel
          cascaded

	- Tracking the next expiring timer can be done with a bitmap,
          so the 'get next expiring timer' becomes O(1) without
          touching anything else than the bitmap, if we accept that
          the upper limit of what we can retrieve O(1) is the
          granularity of the last level, i.e. we treat timers which
          need recascading simple as normal inhabitants of the last
          wheel level.
	  
        - The expiry code (run_timers) gets more complicated as it
          needs to walk through the different levels more often, but
          with the bitmap that shouldnt be a too big issue.

        - Seperate storage for non-deferrable and deferrable timers.

    I spent some time in coding that up. It barely boots and has
    certainly a few bugs left and right, but I will send it out
    nevertheless as a reply to this mail to get the discussion going.

2) The timer migration problem

   I think we need to address this from a different angle.

   Joonwoo tried to solve the deferrable timer issue for non pinned
   timers by introducing a global timer wheel for those. That sucks
   and adds obviously more overhead into the fast pathes. But the idea
   itself that the non idle cpus consume events from the idle ones is
   not the worst. A global wheel might work for the few deferrable
   timers, but it wont work for anything else.

   So instead of deciding at enqueue time, we should come up with
   different wheels for the different types of timers. That way we
   could keep the locality for the networking scenario, and at the
   same time have a way to delegate all non bound timers of a cpu to
   some other cpu at idle time or pull them from the other cpu when
   requested.

   I haven't thought that through fully yet, but it's something we
   should at least investigate thoroughly. I won't do that for the next
   10 days as I'm about to leave for vacation and conferencing.

Thanks,

	tglx
---
 include/linux/hrtimer.h      |    4 
 include/linux/sched.h        |    6 
 include/linux/sched/sysctl.h |   12 -
 include/linux/timer.h        |   56 ++++----
 include/trace/events/timer.h |   13 --
 kernel/rcu/tree_plugin.h     |    2 
 kernel/sched/core.c          |    9 -
 kernel/sysctl.c              |   18 +-
 kernel/time/hrtimer.c        |   38 ++++--
 kernel/time/tick-internal.h  |   14 ++
 kernel/time/tick-sched.c     |   25 ++-
 kernel/time/timer.c          |  271 ++++++++++++++++++++-----------------------
 kernel/time/timer_list.c     |    2 
 kernel/time/timer_stats.c    |   10 -
 14 files changed, 244 insertions(+), 236 deletions(-)

   
   

^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [patch 2/7] timer: Remove FIFO guarantee
@ 2015-06-05 22:27 George Spelvin
  2015-06-08 17:48 ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: George Spelvin @ 2015-06-05 22:27 UTC (permalink / raw)
  To: tglx; +Cc: linux, linux-kernel, viresh.kumar

Two thoughts:

1) It's not clear that timer slack has violated the FIFO guarantee.

   Remember, the guarantee only applies when two timers have the same
   (absolute) timeout; i.e. are requested for the same time.
   
   For two entries with the same timeout, applying slack will produce
   the same adjusted timeout.  That's the whole point of slack; to
   force the timeouts to collide more.

   Because the unadjusted timeouts aren't stored, timeout ordering
   can be violated; i.e. timers for time t and t+1, which round to the
   same time ater slack is applied, will expire in FIFO order rather
   than timeout order.  This is non-

   But in the case where he FIFO guatantee used to exist, I don't think
   timer slack broke it.


   I'm not disagreeing with the change, but it's not clear to me that
   it's as safe as you think.


2) If you want to do some more in that area, one thing I've been meaning
   to get around to is eliminating the whole round_jiffies() system.

   It does basically the same thing as the slack system, although with
   less flexibility, and it would be wonderful to rip it out of
   the kernel completely.


Additional rambling you should ignore.  It exists because I haven't
figured out why it's impractical yet.

An interesting variant on the slack system would be to apply slack in the
wakeup loop rather than before sleeping.  It might permit more bundling
and thus fewer wakeups.

You have the timers in original order.  But computing the next expiration
is more complex.  Each timer has a minimum and maximum wakeup time, and
they're sorted by minimum time.  You scan the list of pending timers,
accumulating a "batch" which can all be woken up together.

You keep intersecting each new timer's wakeup interval with the pending
batch, until the batch maximum is less than the next timer's minimum.

The final result is a range of permissible timeouts, from which one is
chosen as the processor wakeup time.  I'd be inclined to err on the low
side, but whatever works.

(It's also permissible to limit the size of a batch arbitrarily.
After scanning N timers, just pick a wakeup time no later than
the N+1st timer's minimum time and stop.)

When a new timer is added, there are three cases:
1) (Common) Its minimum timeout is later than the current pending wakeup.
   Add it to the pending queue normally.
2) Its wakeup range includes the pending time.  Add it to the batch.
3) Its maximum wakeup time is less than the pending time.  Reschedule
   the hardware timer.  When it goes off, scan the batch for additional
   timers that can be fired now, before building the new batch.

The idea is that each timer would be scanned twice: once when
building the batch, and a second time when running it.

Except that's not true if the third case happens frequently, and
batches keep getting preempted and re-scanned.

But there's an easy workaround to avoid O(n^2): at the expense of
possibly non-optimal wakeups, if what's left of the old batch is larger
than reasonable to rescan, just remember the old wakeup range (the
minimum will be accurate, because it's the minimum of the last timer
in the batch, which is still pending, but the maximum might be wrong)
and extend the batch without recomputing the maximum.

There's probably some reason this can't be made to work, but I wanted
to do a brain dump.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-06-27 11:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
2015-05-27  5:39   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27  9:11   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
2015-05-27  9:13   ` Viresh Kumar
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
2015-05-27  9:22   ` Viresh Kumar
2015-05-27 12:09     ` Peter Zijlstra
2015-06-02 13:58       ` Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
2015-06-27 11:00     ` Borislav Petkov
2015-06-27 11:19       ` Jan Kiszka
2015-06-27 11:25         ` Borislav Petkov
2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
2015-06-19 13:24   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-27 14:53 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Eric Dumazet
2015-06-02 13:58   ` Thomas Gleixner
2015-06-05 22:27 [patch 2/7] timer: Remove FIFO guarantee George Spelvin
2015-06-08 17:48 ` Thomas Gleixner
2015-06-09  5:39   ` George Spelvin
2015-06-09  8:05     ` Thomas Gleixner
2015-06-09  9:43       ` George Spelvin
2015-06-09 10:24         ` Peter Zijlstra
2015-06-09 11:15           ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).