Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4

From: Mel Gorman <mgorman@techsingularity.net>
To: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Mike Galbraith <mgalbraith@suse.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
Date: Wed, 3 Feb 2016 14:56:28 +0000	[thread overview]
Message-ID: <20160203145627.GR8337@techsingularity.net> (raw)
In-Reply-To: <20160203133246.GQ8337@techsingularity.net>

On Wed, Feb 03, 2016 at 01:32:46PM +0000, Mel Gorman wrote:
> > Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there 
> > any 'global' (or per node) counters that we keep touching and which keep 
> > generating cache-misses?
> > 
> 
> I haven't specifically identified them as I consider the calculations for
> some of them to be expensive in their own right even without accounting for
> cache misses. Moving to per-cpu counters would not eliminate all cache misses
> as a stat updated on one CPU for a task that is woken on a separate CPU is
> still going to trigger a cache miss. Even if such counters were identified
> and moved to separate cache lines, the calculation overhead would remain.
> 

I looked closer with perf stat to see if there was a good case for reducing
cache misses using per-cpu counters.

Workload was hackbench with pipes and twice as many processes as there
are CPUs to generate a reasonable amount of scheduler activity.

Kernel 4.5-rc2 vanilla
 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      54355.194747      task-clock (msec)         #   35.825 CPUs utilized            ( +-  0.72% )  (100.00%)
         6,654,707      context-switches          #    0.122 M/sec                    ( +-  1.56% )  (100.00%)
           376,624      cpu-migrations            #    0.007 M/sec                    ( +-  3.43% )  (100.00%)
           128,533      page-faults               #    0.002 M/sec                    ( +-  1.80% )  (100.00%)
   111,173,775,559      cycles                    #    2.045 GHz                      ( +-  0.76% )  (52.55%)
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    87,243,428,243      instructions              #    0.78  insns per cycle          ( +-  0.38% )  (63.74%)
    17,067,078,003      branches                  #  313.992 M/sec                    ( +-  0.39% )  (61.79%)
        65,864,607      branch-misses             #    0.39% of all branches          ( +-  2.10% )  (61.51%)
    26,873,984,605      L1-dcache-loads           #  494.414 M/sec                    ( +-  0.45% )  (33.08%)
     1,531,628,468      L1-dcache-load-misses     #    5.70% of all L1-dcache hits    ( +-  1.14% )  (31.65%)
       410,990,209      LLC-loads                 #    7.561 M/sec                    ( +-  1.08% )  (31.38%)
        38,279,473      LLC-load-misses           #    9.31% of all LL-cache hits     ( +-  6.82% )  (42.35%)

       1.517251315 seconds time elapsed                                          ( +-  1.55% )

Note that the actual cache miss ratio is quite low and indicates that
there is potentially little to gain from using per-cpu counters.

Kernel 4.5-rc2 plus patch that disables schedstats by default

 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      51904.139186      task-clock (msec)         #   35.322 CPUs utilized            ( +-  2.07% )  (100.00%)
         5,958,009      context-switches          #    0.115 M/sec                    ( +-  5.90% )  (100.00%)
           327,235      cpu-migrations            #    0.006 M/sec                    ( +-  8.24% )  (100.00%)
           130,063      page-faults               #    0.003 M/sec                    ( +-  1.10% )  (100.00%)
   104,926,877,727      cycles                    #    2.022 GHz                      ( +-  2.12% )  (52.08%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    83,768,167,895      instructions              #    0.80  insns per cycle          ( +-  1.25% )  (63.49%)
    16,379,438,730      branches                  #  315.571 M/sec                    ( +-  1.47% )  (61.99%)
        59,841,332      branch-misses             #    0.37% of all branches          ( +-  4.60% )  (61.68%)
    25,749,569,276      L1-dcache-loads           #  496.099 M/sec                    ( +-  1.37% )  (34.08%)
     1,385,090,233      L1-dcache-load-misses     #    5.38% of all L1-dcache hits    ( +-  3.40% )  (31.88%)
       358,531,172      LLC-loads                 #    6.908 M/sec                    ( +-  4.65% )  (31.04%)
        33,476,691      LLC-load-misses           #    9.34% of all LL-cache hits     ( +-  4.95% )  (41.71%)

       1.469447783 seconds time elapsed                                          ( +-  2.23% )

Now, note that there is a reduction in cache misses but it's not a major
percentage and the miss ratio is only dropped slightly in comparison to
having stats enabled.

While a perf report shows there is a drop in cache references in
functions like ttwu_stat and [en|de]queue_entity but it's a small
percentage overall. The same is true for the cycle count. The overall
percentage is small but the patch eliminates them.

Based on the low level of cache misses, I see no value to using per-cpu
counters as an alternative.

-- 
Mel Gorman
SUSE Labs