From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752807AbbE0Oxg (ORCPT ); Wed, 27 May 2015 10:53:36 -0400 Received: from mail-ig0-f179.google.com ([209.85.213.179]:37718 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751633AbbE0Oxe (ORCPT ); Wed, 27 May 2015 10:53:34 -0400 Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation From: Eric Dumazet To: Thomas Gleixner Cc: LKML , Ingo Molnar , Peter Zijlstra , Paul McKenney , Frederic Weisbecker , Eric Dumazet , Viresh Kumar , John Stultz , Joonwoo Park , Wenbo Wang Date: Wed, 27 May 2015 07:53:31 -0700 In-Reply-To: <20150526210723.245729529@linutronix.de> References: <20150526210723.245729529@linutronix.de> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote: > I still have a couple of patches against the timer code in my review > list, but the more I look at them the more horrible I find them. > > All these patches are related to the NOHZ stuff and try to mitigate > shortcomings with even more bandaids. And of course these bandaids > cost cycles and are a burden for timer heavy use cases like > networking. > > Sadly enough the NOHZ crowd is happy to throw more and more crap at > the kernel and none of these people is even thinking about whether > this stuff could be solved in a different way. > > After Eric mentioned in one of the recent discussions that the > timer_migration sysctl is not a lot of gain, I tried to mitigate at > least that issue. That caused me to look deeper and I came up with the > following patch series which: > > - Clean up the sloppy catchup timer jiffies stuff > > - Replace the hash bucket lists by hlists, which shrinks the wheel > footprint by 50% > > - Replaces the timer base pointer in struct timer_list by a cpu > index, which shrinks struct timer_list by 4 - 8 bytes depending on > alignement and architecture. > > - Caches nohz_active and timer_migration in the timer per cpu bases, > so we can avoid going through loops and hoops to figure that out. > > This series applies against > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core > > With a pretty naive timer test module and a sched other cpu hog on an > isolated core I verified that I did not wreckage anything. The > following table shows the resulting CPU time of the hog for the > various scenarios. > > nohz=on nohz=on nohz=off > timer_migration=on timer_migration=off > > Unpatched: 46.57% 46.87% 46.64% > > Patched: 47.76% 48.20% 48.73% > > Though, I do not trust my numbers, so I really hope that Eric or any > other power timer wheel user can throw a bunch of tests at it. > > > Now some more rant about the whole NOHZ issues. The timer wheel is > fundamentally unfriendly for NOHZ because: > > - it's impossible to keep reliably track of the next expiring timer > > - finding the next expiring timer is in the worst case O(N) and > involves touching a gazillion of cache lines > > The other disaster area is the nohz timer migration mechanism. I think > it's fundamentally broken. > > - We decide at timer enqueue time, on which CPU we queue the timer, > based on cpu idle heuristics which even fails to recognize that a > cpu is really busy with interrupt and softirq processing (reported > by Eric). > > - When moving a timer to some "non-idle" CPU we speculate about the > system situation in the future (at timer expiry time). This kinda > works for limited scenarios (NOHZ_FULL and constraint power saving > on mobile devices). But it is pretty much nonsensical for > everything else. For network heavy workloads it can be even > outright wrong and dangerous as Eric explained. > > So we really need to put a full stop at all this NOHZ tinkering and > think about proper solutions. I'm not going to merge any NOHZ related > features^Whacks before the existing problems are not solved. In > hindsight I should have done that earlier, but it's never too late. > > So we need to address two issues: > > 1) Keeping track of the first expiring timer in the wheel. > > Don't even think about rbtrees or such, it just wont work, but I'm > willing to accept prove of the contrary. > > One of the ideas I had a long time ago is to have a wheel > implementation, which does never cascade and therefor provides > different levels of granularity per wheel level. > > LVL0 1 jiffy granularity > LVL1 8 jiffies granularity > LVL1 64 jiffies granularity > .... > > This requires more levels than the classic timer wheel, so its not > feasible from a memory consumption POV. > > But we can have a compromise and put all 'soon' expiring timers > into these fancy new wheels and stick all 'far out' timers into the > last level of the wheel and cascade them occasionally. > > That should work because: > > - Most timers are short term expiry (< 1h) > - Most timers are canceled before expiry > > So we need a sensible upper bound of levels and get the following > properties: > > - Natural batching (no more slack magic). This might also help > networking to avoid rearming of timers. > > - Long out timers are queued in the slowest wheel and > ocassionally with the granularity of the slowest wheel > cascaded > > - Tracking the next expiring timer can be done with a bitmap, > so the 'get next expiring timer' becomes O(1) without > touching anything else than the bitmap, if we accept that > the upper limit of what we can retrieve O(1) is the > granularity of the last level, i.e. we treat timers which > need recascading simple as normal inhabitants of the last > wheel level. > > - The expiry code (run_timers) gets more complicated as it > needs to walk through the different levels more often, but > with the bitmap that shouldnt be a too big issue. > > - Seperate storage for non-deferrable and deferrable timers. > > I spent some time in coding that up. It barely boots and has > certainly a few bugs left and right, but I will send it out > nevertheless as a reply to this mail to get the discussion going. Thanks a lot Thomas ! I finally got back two hosts with 72 cpus where I can try your patches. I have one comment about socket TCP timers : They are re-armed very often, but they are never canceled (unless the socket is dismantled) I am too lazy to search for the commit adding this stuff (before git), but I guess it helped a lot RPC sessions : No need to cancel a timer if we're likely to rearm it in the following 10 usec ;) $ git grep -n INET_CSK_CLEAR_TIMERS include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS