linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Eric Dumazet <edumazet@google.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	John Stultz <john.stultz@linaro.org>,
	Joonwoo Park <joonwoop@codeaurora.org>,
	Wenbo Wang <wenbo.wang@memblaze.com>
Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
Date: Wed, 27 May 2015 07:53:31 -0700	[thread overview]
Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <20150526210723.245729529@linutronix.de>

On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
> 
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
> 
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
> 
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
> 
>   - Clean up the sloppy catchup timer jiffies stuff
> 
>   - Replace the hash bucket lists by hlists, which shrinks the wheel
>     footprint by 50%
> 
>   - Replaces the timer base pointer in struct timer_list by a cpu
>     index, which shrinks struct timer_list by 4 - 8 bytes depending on
>     alignement and architecture.
> 
>   - Caches nohz_active and timer_migration in the timer per cpu bases,
>     so we can avoid going through loops and hoops to figure that out.
> 
> This series applies against
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
> 
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
> 
> 	     nohz=on	      	nohz=on			nohz=off
> 	     timer_migration=on	timer_migration=off	    
> 
> Unpatched:   46.57%		46.87%			46.64%
> 
> Patched:     47.76%		48.20%			48.73%
> 
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
> 
> 
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
> 
>   - it's impossible to keep reliably track of the next expiring timer
> 
>   - finding the next expiring timer is in the worst case O(N) and
>     involves touching a gazillion of cache lines
> 
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken. 
> 
>  - We decide at timer enqueue time, on which CPU we queue the timer,
>    based on cpu idle heuristics which even fails to recognize that a
>    cpu is really busy with interrupt and softirq processing (reported
>    by Eric).
> 
>  - When moving a timer to some "non-idle" CPU we speculate about the
>    system situation in the future (at timer expiry time). This kinda
>    works for limited scenarios (NOHZ_FULL and constraint power saving
>    on mobile devices). But it is pretty much nonsensical for
>    everything else. For network heavy workloads it can be even
>    outright wrong and dangerous as Eric explained.
> 
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
> 
> So we need to address two issues:
> 
> 1) Keeping track of the first expiring timer in the wheel.
> 
>    Don't even think about rbtrees or such, it just wont work, but I'm
>    willing to accept prove of the contrary.
> 
>    One of the ideas I had a long time ago is to have a wheel
>    implementation, which does never cascade and therefor provides
>    different levels of granularity per wheel level.
> 
>    LVL0	   1  jiffy granularity
>    LVL1	   8  jiffies granularity
>    LVL1	   64 jiffies granularity
>    ....
> 
>    This requires more levels than the classic timer wheel, so its not
>    feasible from a memory consumption POV.
> 
>    But we can have a compromise and put all 'soon' expiring timers
>    into these fancy new wheels and stick all 'far out' timers into the
>    last level of the wheel and cascade them occasionally.
> 
>    That should work because:
> 
>      - Most timers are short term expiry (< 1h)
>      - Most timers are canceled before expiry
> 
>    So we need a sensible upper bound of levels and get the following
>    properties:
> 
>    	- Natural batching (no more slack magic). This might also help
>           networking to avoid rearming of timers.
> 
> 	- Long out timers are queued in the slowest wheel and
>           ocassionally with the granularity of the slowest wheel
>           cascaded
> 
> 	- Tracking the next expiring timer can be done with a bitmap,
>           so the 'get next expiring timer' becomes O(1) without
>           touching anything else than the bitmap, if we accept that
>           the upper limit of what we can retrieve O(1) is the
>           granularity of the last level, i.e. we treat timers which
>           need recascading simple as normal inhabitants of the last
>           wheel level.
> 	  
>         - The expiry code (run_timers) gets more complicated as it
>           needs to walk through the different levels more often, but
>           with the bitmap that shouldnt be a too big issue.
> 
>         - Seperate storage for non-deferrable and deferrable timers.
> 
>     I spent some time in coding that up. It barely boots and has
>     certainly a few bugs left and right, but I will send it out
>     nevertheless as a reply to this mail to get the discussion going.

Thanks a lot Thomas !

I finally got back two hosts with 72 cpus where I can try your patches.

I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)

I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)

$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS



  parent reply	other threads:[~2015-05-27 14:53 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
2015-05-27  5:39   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27  9:11   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
2015-05-27  9:13   ` Viresh Kumar
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
2015-05-27  9:22   ` Viresh Kumar
2015-05-27 12:09     ` Peter Zijlstra
2015-06-02 13:58       ` Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
2015-06-27 11:00     ` Borislav Petkov
2015-06-27 11:19       ` Jan Kiszka
2015-06-27 11:25         ` Borislav Petkov
2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
2015-06-19 13:24   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-27 14:53 ` Eric Dumazet [this message]
2015-06-02 13:58   ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=john.stultz@linaro.org \
    --cc=joonwoop@codeaurora.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=viresh.kumar@linaro.org \
    --cc=wenbo.wang@memblaze.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).