From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752075AbbEZWwI (ORCPT ); Tue, 26 May 2015 18:52:08 -0400 Received: from www.linutronix.de ([62.245.132.108]:54042 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751085AbbEZWuV (ORCPT ); Tue, 26 May 2015 18:50:21 -0400 Message-Id: <20150526210723.245729529@linutronix.de> User-Agent: quilt/0.63-1 Date: Tue, 26 May 2015 22:50:22 -0000 From: Thomas Gleixner To: LKML Cc: Ingo Molnar , Peter Zijlstra , Paul McKenney , Frederic Weisbecker , Eric Dumazet , Viresh Kumar , John Stultz , Joonwoo Park , Wenbo Wang Subject: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I still have a couple of patches against the timer code in my review list, but the more I look at them the more horrible I find them. All these patches are related to the NOHZ stuff and try to mitigate shortcomings with even more bandaids. And of course these bandaids cost cycles and are a burden for timer heavy use cases like networking. Sadly enough the NOHZ crowd is happy to throw more and more crap at the kernel and none of these people is even thinking about whether this stuff could be solved in a different way. After Eric mentioned in one of the recent discussions that the timer_migration sysctl is not a lot of gain, I tried to mitigate at least that issue. That caused me to look deeper and I came up with the following patch series which: - Clean up the sloppy catchup timer jiffies stuff - Replace the hash bucket lists by hlists, which shrinks the wheel footprint by 50% - Replaces the timer base pointer in struct timer_list by a cpu index, which shrinks struct timer_list by 4 - 8 bytes depending on alignement and architecture. - Caches nohz_active and timer_migration in the timer per cpu bases, so we can avoid going through loops and hoops to figure that out. This series applies against git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core With a pretty naive timer test module and a sched other cpu hog on an isolated core I verified that I did not wreckage anything. The following table shows the resulting CPU time of the hog for the various scenarios. nohz=on nohz=on nohz=off timer_migration=on timer_migration=off Unpatched: 46.57% 46.87% 46.64% Patched: 47.76% 48.20% 48.73% Though, I do not trust my numbers, so I really hope that Eric or any other power timer wheel user can throw a bunch of tests at it. Now some more rant about the whole NOHZ issues. The timer wheel is fundamentally unfriendly for NOHZ because: - it's impossible to keep reliably track of the next expiring timer - finding the next expiring timer is in the worst case O(N) and involves touching a gazillion of cache lines The other disaster area is the nohz timer migration mechanism. I think it's fundamentally broken. - We decide at timer enqueue time, on which CPU we queue the timer, based on cpu idle heuristics which even fails to recognize that a cpu is really busy with interrupt and softirq processing (reported by Eric). - When moving a timer to some "non-idle" CPU we speculate about the system situation in the future (at timer expiry time). This kinda works for limited scenarios (NOHZ_FULL and constraint power saving on mobile devices). But it is pretty much nonsensical for everything else. For network heavy workloads it can be even outright wrong and dangerous as Eric explained. So we really need to put a full stop at all this NOHZ tinkering and think about proper solutions. I'm not going to merge any NOHZ related features^Whacks before the existing problems are not solved. In hindsight I should have done that earlier, but it's never too late. So we need to address two issues: 1) Keeping track of the first expiring timer in the wheel. Don't even think about rbtrees or such, it just wont work, but I'm willing to accept prove of the contrary. One of the ideas I had a long time ago is to have a wheel implementation, which does never cascade and therefor provides different levels of granularity per wheel level. LVL0 1 jiffy granularity LVL1 8 jiffies granularity LVL1 64 jiffies granularity .... This requires more levels than the classic timer wheel, so its not feasible from a memory consumption POV. But we can have a compromise and put all 'soon' expiring timers into these fancy new wheels and stick all 'far out' timers into the last level of the wheel and cascade them occasionally. That should work because: - Most timers are short term expiry (< 1h) - Most timers are canceled before expiry So we need a sensible upper bound of levels and get the following properties: - Natural batching (no more slack magic). This might also help networking to avoid rearming of timers. - Long out timers are queued in the slowest wheel and ocassionally with the granularity of the slowest wheel cascaded - Tracking the next expiring timer can be done with a bitmap, so the 'get next expiring timer' becomes O(1) without touching anything else than the bitmap, if we accept that the upper limit of what we can retrieve O(1) is the granularity of the last level, i.e. we treat timers which need recascading simple as normal inhabitants of the last wheel level. - The expiry code (run_timers) gets more complicated as it needs to walk through the different levels more often, but with the bitmap that shouldnt be a too big issue. - Seperate storage for non-deferrable and deferrable timers. I spent some time in coding that up. It barely boots and has certainly a few bugs left and right, but I will send it out nevertheless as a reply to this mail to get the discussion going. 2) The timer migration problem I think we need to address this from a different angle. Joonwoo tried to solve the deferrable timer issue for non pinned timers by introducing a global timer wheel for those. That sucks and adds obviously more overhead into the fast pathes. But the idea itself that the non idle cpus consume events from the idle ones is not the worst. A global wheel might work for the few deferrable timers, but it wont work for anything else. So instead of deciding at enqueue time, we should come up with different wheels for the different types of timers. That way we could keep the locality for the networking scenario, and at the same time have a way to delegate all non bound timers of a cpu to some other cpu at idle time or pull them from the other cpu when requested. I haven't thought that through fully yet, but it's something we should at least investigate thoroughly. I won't do that for the next 10 days as I'm about to leave for vacation and conferencing. Thanks, tglx --- include/linux/hrtimer.h | 4 include/linux/sched.h | 6 include/linux/sched/sysctl.h | 12 - include/linux/timer.h | 56 ++++---- include/trace/events/timer.h | 13 -- kernel/rcu/tree_plugin.h | 2 kernel/sched/core.c | 9 - kernel/sysctl.c | 18 +- kernel/time/hrtimer.c | 38 ++++-- kernel/time/tick-internal.h | 14 ++ kernel/time/tick-sched.c | 25 ++- kernel/time/timer.c | 271 ++++++++++++++++++++----------------------- kernel/time/timer_list.c | 2 kernel/time/timer_stats.c | 10 - 14 files changed, 244 insertions(+), 236 deletions(-)