From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752807AbbE0Oxg (ORCPT <rfc822;w@1wt.eu>);
	Wed, 27 May 2015 10:53:36 -0400
Received: from mail-ig0-f179.google.com ([209.85.213.179]:37718 "EHLO
	mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751633AbbE0Oxe (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 27 May 2015 10:53:34 -0400
Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com>
Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Eric Dumazet <edumazet@google.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        John Stultz <john.stultz@linaro.org>,
        Joonwoo Park <joonwoop@codeaurora.org>,
        Wenbo Wang <wenbo.wang@memblaze.com>
Date: Wed, 27 May 2015 07:53:31 -0700
In-Reply-To: <20150526210723.245729529@linutronix.de>
References: <20150526210723.245729529@linutronix.de>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.10.4-0ubuntu2 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
> 
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
> 
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
> 
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
> 
>   - Clean up the sloppy catchup timer jiffies stuff
> 
>   - Replace the hash bucket lists by hlists, which shrinks the wheel
>     footprint by 50%
> 
>   - Replaces the timer base pointer in struct timer_list by a cpu
>     index, which shrinks struct timer_list by 4 - 8 bytes depending on
>     alignement and architecture.
> 
>   - Caches nohz_active and timer_migration in the timer per cpu bases,
>     so we can avoid going through loops and hoops to figure that out.
> 
> This series applies against
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
> 
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
> 
> 	     nohz=on	      	nohz=on			nohz=off
> 	     timer_migration=on	timer_migration=off	    
> 
> Unpatched:   46.57%		46.87%			46.64%
> 
> Patched:     47.76%		48.20%			48.73%
> 
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
> 
> 
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
> 
>   - it's impossible to keep reliably track of the next expiring timer
> 
>   - finding the next expiring timer is in the worst case O(N) and
>     involves touching a gazillion of cache lines
> 
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken. 
> 
>  - We decide at timer enqueue time, on which CPU we queue the timer,
>    based on cpu idle heuristics which even fails to recognize that a
>    cpu is really busy with interrupt and softirq processing (reported
>    by Eric).
> 
>  - When moving a timer to some "non-idle" CPU we speculate about the
>    system situation in the future (at timer expiry time). This kinda
>    works for limited scenarios (NOHZ_FULL and constraint power saving
>    on mobile devices). But it is pretty much nonsensical for
>    everything else. For network heavy workloads it can be even
>    outright wrong and dangerous as Eric explained.
> 
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
> 
> So we need to address two issues:
> 
> 1) Keeping track of the first expiring timer in the wheel.
> 
>    Don't even think about rbtrees or such, it just wont work, but I'm
>    willing to accept prove of the contrary.
> 
>    One of the ideas I had a long time ago is to have a wheel
>    implementation, which does never cascade and therefor provides
>    different levels of granularity per wheel level.
> 
>    LVL0	   1  jiffy granularity
>    LVL1	   8  jiffies granularity
>    LVL1	   64 jiffies granularity
>    ....
> 
>    This requires more levels than the classic timer wheel, so its not
>    feasible from a memory consumption POV.
> 
>    But we can have a compromise and put all 'soon' expiring timers
>    into these fancy new wheels and stick all 'far out' timers into the
>    last level of the wheel and cascade them occasionally.
> 
>    That should work because:
> 
>      - Most timers are short term expiry (< 1h)
>      - Most timers are canceled before expiry
> 
>    So we need a sensible upper bound of levels and get the following
>    properties:
> 
>    	- Natural batching (no more slack magic). This might also help
>           networking to avoid rearming of timers.
> 
> 	- Long out timers are queued in the slowest wheel and
>           ocassionally with the granularity of the slowest wheel
>           cascaded
> 
> 	- Tracking the next expiring timer can be done with a bitmap,
>           so the 'get next expiring timer' becomes O(1) without
>           touching anything else than the bitmap, if we accept that
>           the upper limit of what we can retrieve O(1) is the
>           granularity of the last level, i.e. we treat timers which
>           need recascading simple as normal inhabitants of the last
>           wheel level.
> 	  
>         - The expiry code (run_timers) gets more complicated as it
>           needs to walk through the different levels more often, but
>           with the bitmap that shouldnt be a too big issue.
> 
>         - Seperate storage for non-deferrable and deferrable timers.
> 
>     I spent some time in coding that up. It barely boots and has
>     certainly a few bugs left and right, but I will send it out
>     nevertheless as a reply to this mail to get the discussion going.

Thanks a lot Thomas !

I finally got back two hosts with 72 cpus where I can try your patches.

I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)

I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)

$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS