From: Eric Dumazet <eric.dumazet@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Paul McKenney <paulmck@linux.vnet.ibm.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Eric Dumazet <edumazet@google.com>,
Viresh Kumar <viresh.kumar@linaro.org>,
John Stultz <john.stultz@linaro.org>,
Joonwoo Park <joonwoop@codeaurora.org>,
Wenbo Wang <wenbo.wang@memblaze.com>
Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
Date: Wed, 27 May 2015 07:53:31 -0700 [thread overview]
Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <20150526210723.245729529@linutronix.de>
On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
>
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
>
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
>
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
>
> - Clean up the sloppy catchup timer jiffies stuff
>
> - Replace the hash bucket lists by hlists, which shrinks the wheel
> footprint by 50%
>
> - Replaces the timer base pointer in struct timer_list by a cpu
> index, which shrinks struct timer_list by 4 - 8 bytes depending on
> alignement and architecture.
>
> - Caches nohz_active and timer_migration in the timer per cpu bases,
> so we can avoid going through loops and hoops to figure that out.
>
> This series applies against
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
>
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
>
> nohz=on nohz=on nohz=off
> timer_migration=on timer_migration=off
>
> Unpatched: 46.57% 46.87% 46.64%
>
> Patched: 47.76% 48.20% 48.73%
>
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
>
>
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
>
> - it's impossible to keep reliably track of the next expiring timer
>
> - finding the next expiring timer is in the worst case O(N) and
> involves touching a gazillion of cache lines
>
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken.
>
> - We decide at timer enqueue time, on which CPU we queue the timer,
> based on cpu idle heuristics which even fails to recognize that a
> cpu is really busy with interrupt and softirq processing (reported
> by Eric).
>
> - When moving a timer to some "non-idle" CPU we speculate about the
> system situation in the future (at timer expiry time). This kinda
> works for limited scenarios (NOHZ_FULL and constraint power saving
> on mobile devices). But it is pretty much nonsensical for
> everything else. For network heavy workloads it can be even
> outright wrong and dangerous as Eric explained.
>
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
>
> So we need to address two issues:
>
> 1) Keeping track of the first expiring timer in the wheel.
>
> Don't even think about rbtrees or such, it just wont work, but I'm
> willing to accept prove of the contrary.
>
> One of the ideas I had a long time ago is to have a wheel
> implementation, which does never cascade and therefor provides
> different levels of granularity per wheel level.
>
> LVL0 1 jiffy granularity
> LVL1 8 jiffies granularity
> LVL1 64 jiffies granularity
> ....
>
> This requires more levels than the classic timer wheel, so its not
> feasible from a memory consumption POV.
>
> But we can have a compromise and put all 'soon' expiring timers
> into these fancy new wheels and stick all 'far out' timers into the
> last level of the wheel and cascade them occasionally.
>
> That should work because:
>
> - Most timers are short term expiry (< 1h)
> - Most timers are canceled before expiry
>
> So we need a sensible upper bound of levels and get the following
> properties:
>
> - Natural batching (no more slack magic). This might also help
> networking to avoid rearming of timers.
>
> - Long out timers are queued in the slowest wheel and
> ocassionally with the granularity of the slowest wheel
> cascaded
>
> - Tracking the next expiring timer can be done with a bitmap,
> so the 'get next expiring timer' becomes O(1) without
> touching anything else than the bitmap, if we accept that
> the upper limit of what we can retrieve O(1) is the
> granularity of the last level, i.e. we treat timers which
> need recascading simple as normal inhabitants of the last
> wheel level.
>
> - The expiry code (run_timers) gets more complicated as it
> needs to walk through the different levels more often, but
> with the bitmap that shouldnt be a too big issue.
>
> - Seperate storage for non-deferrable and deferrable timers.
>
> I spent some time in coding that up. It barely boots and has
> certainly a few bugs left and right, but I will send it out
> nevertheless as a reply to this mail to get the discussion going.
Thanks a lot Thomas !
I finally got back two hosts with 72 cpus where I can try your patches.
I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)
I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)
$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS
next prev parent reply other threads:[~2015-05-27 14:53 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
2015-05-27 5:39 ` Viresh Kumar
2015-06-19 13:22 ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27 9:11 ` Viresh Kumar
2015-06-19 13:22 ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
2015-05-27 9:13 ` Viresh Kumar
2015-06-19 13:23 ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
2015-05-27 9:22 ` Viresh Kumar
2015-05-27 12:09 ` Peter Zijlstra
2015-06-02 13:58 ` Thomas Gleixner
2015-06-19 13:23 ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-06-27 9:55 ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
2015-06-27 11:00 ` Borislav Petkov
2015-06-27 11:19 ` Jan Kiszka
2015-06-27 11:25 ` Borislav Petkov
2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
2015-06-19 13:23 ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
2015-06-19 13:23 ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
2015-06-19 13:24 ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-27 14:53 ` Eric Dumazet [this message]
2015-06-02 13:58 ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com \
--to=eric.dumazet@gmail.com \
--cc=edumazet@google.com \
--cc=fweisbec@gmail.com \
--cc=john.stultz@linaro.org \
--cc=joonwoop@codeaurora.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=viresh.kumar@linaro.org \
--cc=wenbo.wang@memblaze.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).