From: Mel Gorman <mgorman@techsingularity.net>
To: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Matt Fleming <matt@codeblueprint.co.uk>,
Mike Galbraith <mgalbraith@suse.de>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
Date: Wed, 3 Feb 2016 14:56:28 +0000 [thread overview]
Message-ID: <20160203145627.GR8337@techsingularity.net> (raw)
In-Reply-To: <20160203133246.GQ8337@techsingularity.net>
On Wed, Feb 03, 2016 at 01:32:46PM +0000, Mel Gorman wrote:
> > Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there
> > any 'global' (or per node) counters that we keep touching and which keep
> > generating cache-misses?
> >
>
> I haven't specifically identified them as I consider the calculations for
> some of them to be expensive in their own right even without accounting for
> cache misses. Moving to per-cpu counters would not eliminate all cache misses
> as a stat updated on one CPU for a task that is woken on a separate CPU is
> still going to trigger a cache miss. Even if such counters were identified
> and moved to separate cache lines, the calculation overhead would remain.
>
I looked closer with perf stat to see if there was a good case for reducing
cache misses using per-cpu counters.
Workload was hackbench with pipes and twice as many processes as there
are CPUs to generate a reasonable amount of scheduler activity.
Kernel 4.5-rc2 vanilla
Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):
54355.194747 task-clock (msec) # 35.825 CPUs utilized ( +- 0.72% ) (100.00%)
6,654,707 context-switches # 0.122 M/sec ( +- 1.56% ) (100.00%)
376,624 cpu-migrations # 0.007 M/sec ( +- 3.43% ) (100.00%)
128,533 page-faults # 0.002 M/sec ( +- 1.80% ) (100.00%)
111,173,775,559 cycles # 2.045 GHz ( +- 0.76% ) (52.55%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
87,243,428,243 instructions # 0.78 insns per cycle ( +- 0.38% ) (63.74%)
17,067,078,003 branches # 313.992 M/sec ( +- 0.39% ) (61.79%)
65,864,607 branch-misses # 0.39% of all branches ( +- 2.10% ) (61.51%)
26,873,984,605 L1-dcache-loads # 494.414 M/sec ( +- 0.45% ) (33.08%)
1,531,628,468 L1-dcache-load-misses # 5.70% of all L1-dcache hits ( +- 1.14% ) (31.65%)
410,990,209 LLC-loads # 7.561 M/sec ( +- 1.08% ) (31.38%)
38,279,473 LLC-load-misses # 9.31% of all LL-cache hits ( +- 6.82% ) (42.35%)
1.517251315 seconds time elapsed ( +- 1.55% )
Note that the actual cache miss ratio is quite low and indicates that
there is potentially little to gain from using per-cpu counters.
Kernel 4.5-rc2 plus patch that disables schedstats by default
Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):
51904.139186 task-clock (msec) # 35.322 CPUs utilized ( +- 2.07% ) (100.00%)
5,958,009 context-switches # 0.115 M/sec ( +- 5.90% ) (100.00%)
327,235 cpu-migrations # 0.006 M/sec ( +- 8.24% ) (100.00%)
130,063 page-faults # 0.003 M/sec ( +- 1.10% ) (100.00%)
104,926,877,727 cycles # 2.022 GHz ( +- 2.12% ) (52.08%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
83,768,167,895 instructions # 0.80 insns per cycle ( +- 1.25% ) (63.49%)
16,379,438,730 branches # 315.571 M/sec ( +- 1.47% ) (61.99%)
59,841,332 branch-misses # 0.37% of all branches ( +- 4.60% ) (61.68%)
25,749,569,276 L1-dcache-loads # 496.099 M/sec ( +- 1.37% ) (34.08%)
1,385,090,233 L1-dcache-load-misses # 5.38% of all L1-dcache hits ( +- 3.40% ) (31.88%)
358,531,172 LLC-loads # 6.908 M/sec ( +- 4.65% ) (31.04%)
33,476,691 LLC-load-misses # 9.34% of all LL-cache hits ( +- 4.95% ) (41.71%)
1.469447783 seconds time elapsed ( +- 2.23% )
Now, note that there is a reduction in cache misses but it's not a major
percentage and the miss ratio is only dropped slightly in comparison to
having stats enabled.
While a perf report shows there is a drop in cache references in
functions like ttwu_stat and [en|de]queue_entity but it's a small
percentage overall. The same is true for the cycle count. The overall
percentage is small but the patch eliminates them.
Based on the low level of cache misses, I see no value to using per-cpu
counters as an alternative.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2016-02-03 14:56 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-03 11:07 [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4 Mel Gorman
2016-02-03 11:28 ` Ingo Molnar
2016-02-03 11:39 ` Mel Gorman
2016-02-03 12:49 ` Ingo Molnar
2016-02-03 13:32 ` Mel Gorman
2016-02-03 14:56 ` Mel Gorman [this message]
2016-02-03 11:51 ` Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160203145627.GR8337@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=linux-kernel@vger.kernel.org \
--cc=matt@codeblueprint.co.uk \
--cc=mgalbraith@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=srikar@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.