[patch 0/7] timers: Footprint diet and NOHZ overhead mitigation

* [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
@ 2015-05-26 22:50 Thomas Gleixner
  2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Thomas Gleixner @ 2015-05-26 22:50 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul McKenney, Frederic Weisbecker,
	Eric Dumazet, Viresh Kumar, John Stultz, Joonwoo Park,
	Wenbo Wang

I still have a couple of patches against the timer code in my review
list, but the more I look at them the more horrible I find them.

All these patches are related to the NOHZ stuff and try to mitigate
shortcomings with even more bandaids. And of course these bandaids
cost cycles and are a burden for timer heavy use cases like
networking.

Sadly enough the NOHZ crowd is happy to throw more and more crap at
the kernel and none of these people is even thinking about whether
this stuff could be solved in a different way.

After Eric mentioned in one of the recent discussions that the
timer_migration sysctl is not a lot of gain, I tried to mitigate at
least that issue. That caused me to look deeper and I came up with the
following patch series which:

  - Clean up the sloppy catchup timer jiffies stuff

  - Replace the hash bucket lists by hlists, which shrinks the wheel
    footprint by 50%

  - Replaces the timer base pointer in struct timer_list by a cpu
    index, which shrinks struct timer_list by 4 - 8 bytes depending on
    alignement and architecture.

  - Caches nohz_active and timer_migration in the timer per cpu bases,
    so we can avoid going through loops and hoops to figure that out.

This series applies against

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core

With a pretty naive timer test module and a sched other cpu hog on an
isolated core I verified that I did not wreckage anything. The
following table shows the resulting CPU time of the hog for the
various scenarios.

	     nohz=on	      	nohz=on			nohz=off
	     timer_migration=on	timer_migration=off	    

Unpatched:   46.57%		46.87%			46.64%

Patched:     47.76%		48.20%			48.73%

Though, I do not trust my numbers, so I really hope that Eric or any
other power timer wheel user can throw a bunch of tests at it.

Now some more rant about the whole NOHZ issues. The timer wheel is
fundamentally unfriendly for NOHZ because:

  - it's impossible to keep reliably track of the next expiring timer

  - finding the next expiring timer is in the worst case O(N) and
    involves touching a gazillion of cache lines

The other disaster area is the nohz timer migration mechanism. I think
it's fundamentally broken. 

 - We decide at timer enqueue time, on which CPU we queue the timer,
   based on cpu idle heuristics which even fails to recognize that a
   cpu is really busy with interrupt and softirq processing (reported
   by Eric).

 - When moving a timer to some "non-idle" CPU we speculate about the
   system situation in the future (at timer expiry time). This kinda
   works for limited scenarios (NOHZ_FULL and constraint power saving
   on mobile devices). But it is pretty much nonsensical for
   everything else. For network heavy workloads it can be even
   outright wrong and dangerous as Eric explained.

So we really need to put a full stop at all this NOHZ tinkering and
think about proper solutions. I'm not going to merge any NOHZ related
features^Whacks before the existing problems are not solved. In
hindsight I should have done that earlier, but it's never too late.

So we need to address two issues:

1) Keeping track of the first expiring timer in the wheel.

   Don't even think about rbtrees or such, it just wont work, but I'm
   willing to accept prove of the contrary.

   One of the ideas I had a long time ago is to have a wheel
   implementation, which does never cascade and therefor provides
   different levels of granularity per wheel level.

   LVL0	   1  jiffy granularity
   LVL1	   8  jiffies granularity
   LVL1	   64 jiffies granularity
   ....

   This requires more levels than the classic timer wheel, so its not
   feasible from a memory consumption POV.

   But we can have a compromise and put all 'soon' expiring timers
   into these fancy new wheels and stick all 'far out' timers into the
   last level of the wheel and cascade them occasionally.

   That should work because:

     - Most timers are short term expiry (< 1h)
     - Most timers are canceled before expiry

   So we need a sensible upper bound of levels and get the following
   properties:

   	- Natural batching (no more slack magic). This might also help
          networking to avoid rearming of timers.

	- Long out timers are queued in the slowest wheel and
          ocassionally with the granularity of the slowest wheel
          cascaded

	- Tracking the next expiring timer can be done with a bitmap,
          so the 'get next expiring timer' becomes O(1) without
          touching anything else than the bitmap, if we accept that
          the upper limit of what we can retrieve O(1) is the
          granularity of the last level, i.e. we treat timers which
          need recascading simple as normal inhabitants of the last
          wheel level.

        - The expiry code (run_timers) gets more complicated as it
          needs to walk through the different levels more often, but
          with the bitmap that shouldnt be a too big issue.

        - Seperate storage for non-deferrable and deferrable timers.

    I spent some time in coding that up. It barely boots and has
    certainly a few bugs left and right, but I will send it out
    nevertheless as a reply to this mail to get the discussion going.

2) The timer migration problem

   I think we need to address this from a different angle.

   Joonwoo tried to solve the deferrable timer issue for non pinned
   timers by introducing a global timer wheel for those. That sucks
   and adds obviously more overhead into the fast pathes. But the idea
   itself that the non idle cpus consume events from the idle ones is
   not the worst. A global wheel might work for the few deferrable
   timers, but it wont work for anything else.

   So instead of deciding at enqueue time, we should come up with
   different wheels for the different types of timers. That way we
   could keep the locality for the networking scenario, and at the
   same time have a way to delegate all non bound timers of a cpu to
   some other cpu at idle time or pull them from the other cpu when
   requested.

   I haven't thought that through fully yet, but it's something we
   should at least investigate thoroughly. I won't do that for the next
   10 days as I'm about to leave for vacation and conferencing.

Thanks,

	tglx
---
 include/linux/hrtimer.h      |    4 
 include/linux/sched.h        |    6 
 include/linux/sched/sysctl.h |   12 -
 include/linux/timer.h        |   56 ++++----
 include/trace/events/timer.h |   13 --
 kernel/rcu/tree_plugin.h     |    2 
 kernel/sched/core.c          |    9 -
 kernel/sysctl.c              |   18 +-
 kernel/time/hrtimer.c        |   38 ++++--
 kernel/time/tick-internal.h  |   14 ++
 kernel/time/tick-sched.c     |   25 ++-
 kernel/time/timer.c          |  271 ++++++++++++++++++++-----------------------
 kernel/time/timer_list.c     |    2 
 kernel/time/timer_stats.c    |   10 -
 14 files changed, 244 insertions(+), 236 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread