Linux-PM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/6] Add support for frequency invariance for (some) x86
@ 2019-11-13 12:46 Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

v3 at https://lore.kernel.org/lkml/20191111180549.12166-1-ggherdovich@suse.cz/

Changes wrt v3:

- Add definition of function set_arch_max_freq if !CONFIG_SMP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v3:

v2 at https://lore.kernel.org/lkml/20191002122926.385-1-ggherdovich@suse.cz/

Changes wrt v2:

- Removing the tick_disable mechanism. Frequency scale-invariance isn't
  just about helping schedutil choose better frequencies, but also
  providing the scheduler load balancer with better metrics. All users of
  PELT signals benefit from this feature. The tick_disable patch disabled
  frequency invariant calculation when a specific driver is in use
  (intel_pstate in active mode).

- static_branch_enable(&arch_scale_freq_key) is now called earlier, right
  after we learn that X86_FEATURE_APERFMPERF is available. Previously Peter
  Z. commented "if we can't tell the max_freq we don't want to use the
  invariant stuff.". I've decided to do it differently: if we can't tell
  the max_freq, then it's because the CPU encodes max_freq in MSRs in a way
  this patch doesn't understand, and we assume max_p is the max_freq which
  seems like a safe bet. As a reminder, max_freq=max_p is encoded by
  setting arch_max_freq=1024 as default value. I'm open to feedback.

- Refactoring the switch case statement in set_cpu_max_freq() as Rafael
  W. Now the first patch doesn't hint at what the following patch will
  bring along.

- Handling the case were turbo is disabled at runtime and a _PPC ACPI
  notification is issued, as requested by Rafael W. This happens eg. when
  some laptop model is disconnected from AC. (Patch #6)

- Handling all Intel x86_64 micro-arches.

- A note for Srinivas P., who expressed concern for Atoms: on Atom CPUs the
  max_freq is set to the highest turbo level, as a power-efficiency
  oriented measure. In this way the ratio curr_freq/max_freq tends to be
  lower, PELT signals are consequently lower, and schedutil doesn't push
  too hard on speed. (Patches #4 and #5).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v2:

v1 at https://lore.kernel.org/lkml/20190909024216.5942-1-ggherdovich@suse.cz/

Changes wrt v1:

- add x86-specific implementation of arch_scale_freq_invariant() using a
  static key that checks for the availability of APERF and MPERF
- refer to GOLDMONT_D instead of GOLDMONT_X, according to recent rename
- set arch_cpu_freq to 1024 from x86_arch_scale_freq_tick_disable() to prevent
  PELT from being fed stale data

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v1:

This is a resend with of Peter Zijlstra's patch to support frequency
scale-invariance on x86 from May 2018 [see 1]. I've added some modifications
and included performance test results. If Peter doesn't mind, I'm slapping my
name on it :)

The changes from Peter's original implementation are:

1) normalizing against the 4-cores turbo level instead or 1-core turbo
2) removing the run-time search for when the above value isn't found in the
   various Intel MSRs -- the base frequency value is taken in that case.

The section "4. KNOWN LIMITATIONS" in the first patch commit message addresses
the reason why this approach was dropped back in 2018, and explains that the
performance gains outweight that issue.

The second patch from Srinivas is taken verbatim from the May 2018 submission
as it still applies.

I apologies for the length of patch #1 commit message; I've made a table of
contents with summaries of each section that should make easier to skim
through the content.

This submission incorporates the feedback and requests for additional tests
received during the presentation made at OSPM 2019 in Pisa three months ago.

[1] https://lore.kernel.org/lkml/20180516044911.28797-2-srinivas.pandruvada@linux.intel.com/

Giovanni Gherdovich (6):
  x86,sched: Add support for frequency invariance
  x86,sched: Add support for frequency invariance on SKYLAKE_X
  x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
  x86,sched: Add support for frequency invariance on ATOM_GOLDMONT*
  x86,sched: Add support for frequency invariance on ATOM
  x86: intel_pstate: handle runtime turbo disablement/enablement in
    freq. invariance

 arch/x86/include/asm/topology.h |  28 ++++
 arch/x86/kernel/smpboot.c       | 324 +++++++++++++++++++++++++++++++++++++++-
 drivers/cpufreq/intel_pstate.c  |   1 +
 kernel/sched/core.c             |   1 +
 kernel/sched/sched.h            |   7 +
 5 files changed, 360 insertions(+), 1 deletion(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  2019-11-24  7:49   ` Doug Smythies
  2019-12-02 16:34   ` Ionela Voinescu
  2019-11-13 12:46 ` [PATCH v4 2/6] x86,sched: Add support for frequency invariance on SKYLAKE_X Giovanni Gherdovich
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.

Large performance gains are obtained when the machine is lightly loaded and no
regression are observed at saturation. The benchmarks with the largest gains
are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.

1. FREQUENCY INVARIANCE: MOTIVATION
   * Without it, a task looks larger if the CPU runs slower

2. PECULIARITIES OF X86
   * freq invariance accounting requires knowing the ratio freq_curr/freq_max
   2.1 CURRENT FREQUENCY
       * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
   2.2 MAX FREQUENCY
       * It varies with time (turbo). As an approximation, we set it to a
         constant, i.e. 4-cores turbo frequency (or base frequency if nothing
         else is reported by MSRs)

3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
   * The invariant schedutil's formula has no feedback loop and reacts faster
     to utilization changes

4. KNOWN LIMITATIONS
   * In some cases tasks can't reach max util despite how hard they try

5. PERFORMANCE TESTING
   5.1 MACHINES
       * Skylake, Broadwell, Haswell
   5.2 SETUP
       * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
         active cores turbo w/ invariant schedutil, and intel_pstate/powersave
   5.3 BENCHMARK RESULTS
       5.3.1 NEUTRAL BENCHMARKS
             * NAS Parallel Benchmark (HPC), hackbench
       5.3.2 NON-NEUTRAL BENCHMARKS
             * tbench (10-30% better), kernbench (10-15% better),
               shell-intensive-scripts (30-50% better)
             * no regressions
       5.3.3 SELECTION OF DETAILED RESULTS
       5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
             * dbench (5% worse on one machine), kernbench (3% worse),
               tbench (5-10% better), shell-intensive-scripts (10-40% better)

6. MICROARCH'ES ADDRESSED HERE
   * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
     etc have different MSRs semantic for querying turbo levels)

7. REFERENCES
   * MMTests performance testing framework, github.com/gormanm/mmtests

+-------------------------------------------------------------------------+
| 1. FREQUENCY INVARIANCE: MOTIVATION
+-------------------------------------------------------------------------+

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.

[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)

+-------------------------------------------------------------------------+
| 2. PECULIARITIES OF X86
+-------------------------------------------------------------------------+

Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.

2.1 CURRENT FREQUENCY
====================

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

2.2 MAX FREQUENCY
=================

Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.

The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:

    * 1-core (1C) turbo frequency (the fastest turbo state available)
    * around base frequency (a.k.a. max P-state)
    * something in between, such as 4C turbo

To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:

    freq_max set to     | effect on DVFS
    --------------------+------------------
    1C turbo            | power efficiency (lower freq choices)
    base freq           | performance (higher util_avg, higher freq requests)
    4C turbo            | a bit of both

4C turbo proves to be a good compromise in a number of benchmarks (see
below). Note that when the function intel_set_cpu_max_freq() fails to query
the various MSRs for the 4C turbo value, the variable arch_max_freq retains
its default value of SCHED_CAPACITY_SCALE (1024) that corresponds to setting
freq_max to base frequency wrt the table above.

+-------------------------------------------------------------------------+
| 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
+-------------------------------------------------------------------------+

Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from

    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]

to

    freq_next = 1.25 * freq_max * util             [invariant util signal]

where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.

Compare it to the update formula of intel_pstate/powersave:

    freq_next = 1.25 * freq_max * Busy%

where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.

Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.

+-------------------------------------------------------------------------+
| 4. KNOWN LIMITATIONS
+-------------------------------------------------------------------------+

It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).

If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.

While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.

[Lelli-2018]
https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/

+-------------------------------------------------------------------------+
| 5. PERFORMANCE TESTING
+-------------------------------------------------------------------------+

5.1 MACHINES
============

We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.

* 8x-SKYLAKE-UMA:
  Single socket E3-1240 v5, Skylake 4 cores/8 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC    800 |********
    BASE    3500 |***********************************
    4C      3700 |*************************************
    3C      3800 |**************************************
    2C      3900 |***************************************
    1C      3900 |***************************************

* 80x-BROADWELL-NUMA:
  Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2200 |**********************
    8C      2900 |*****************************
    7C      3000 |******************************
    6C      3100 |*******************************
    5C      3200 |********************************
    4C      3300 |*********************************
    3C      3400 |**********************************
    2C      3600 |************************************
    1C      3600 |************************************

* 48x-HASWELL-NUMA
  Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2300 |***********************
    12C     2600 |**************************
    11C     2600 |**************************
    10C     2600 |**************************
    9C      2600 |**************************
    8C      2600 |**************************
    7C      2600 |**************************
    6C      2600 |**************************
    5C      2700 |***************************
    4C      2800 |****************************
    3C      2900 |*****************************
    2C      3100 |*******************************
    1C      3100 |*******************************

5.2 SETUP
=========

* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
  driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
  try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
  on all machines), plus one more value closer to base_freq but still in the
  turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
  with active intel_pstate on this machine use that.

This gives, in terms of combinations tested on each machine:

* 8x-SKYLAKE-UMA
  * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
  * intel_pstate active + powersave + HWP
  * invariant schedutil, freq_max = 1C turbo
  * invariant schedutil, freq_max = 3C turbo
  * invariant schedutil, freq_max = 4C turbo

* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
  * [same as 8x-SKYLAKE-UMA, but no HWP capable]
  * invariant schedutil, freq_max = 8C turbo
  * (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")

5.3 BENCHMARK RESULTS
=====================

5.3.1 NEUTRAL BENCHMARKS
------------------------

Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:

* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
  computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)

5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------

What follow are summary tables where each benchmark result is given a score.

* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
  means a score of 1.00.
* The results in the score ratio are the geometric means of results running
  the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
  ... number of processes; for pgbench: varying the number of clients, and so
  on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
  operations/second), the subsequent three show lower-is-better kind of tests
  (i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
  entire unit tests suite of the Git SCM and measuring how long it takes. We
  take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
  columns show invariant schedutil for different values of freq_max. 4C turbo
  is circled as it's the value we've chosen for the final implementation.

80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
pgbench-ro           ~      ~      ~     | ~    |
pgbench-rw           ~      ~      ~     | ~    |
netperf-udp          ~      ~      ~     | ~    |
netperf-tcp          ~      ~      ~     | ~    |
tbench4              1.30   1.14   1.14  | 1.16 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  12C
pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                         +------+

In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.

If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.

80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              1.23   0.95   0.95  | 0.95 |  0.95
kernbench            0.93   0.83   0.83  | 0.83 |  0.82
gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
dbench4              ~      ~      ~     | ~    |
kernbench            ~      ~      ~     | ~    |
gitsource            0.92   0.55   0.55  | 0.55 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              ~      ~      ~     | ~    |  ~
kernbench            0.94   0.90   0.89  | 0.90 |  0.90
gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                         +------+

dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.

On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).

The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.

5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------

Machine            : 48x-HASWELL-NUMA
Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)

This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.

The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.

Machine            : 80x-BROADWELL-NUMA
Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)

The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.

Machine            : 8x-SKYLAKE-UMA
Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                            5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)

                           5.2.0 3C-turbo                   5.2.0 4C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)

In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.

5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------

The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.

turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).

80x-BROADWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |      8C
pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                    +--------+
8x-SKYLAKE-UMA (power consumption, watts)
                                                    +--------+
              BASELINE I_PSTATE/HWP    1C       3C  |     4C |
pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
dbench4           27.28    27.37    27.49    27.41  |  27.38 |
netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
tbench4           41.13    45.61    43.10    43.33  |  43.56 |
kernbench         42.56    42.63    43.01    43.01  |  43.01 |
gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                    +--------+
48x-HASWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |     12C
pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                    +--------+

A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).

80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |    8C
pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
dbench4          1.24   0.94   0.95  | 0.94 |  0.92
netperf-udp      ~      1.02   1.02  | ~    |  1.02
netperf-tcp      ~      1.02   ~     | ~    |  1.02
tbench4          1.26   1.10   1.06  | 1.12 |  1.26
kernbench        0.98   0.97   0.97  | 0.97 |  0.98
gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                     +------+

8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                     +------+
         I_PSTATE/HWP     1C     3C  |   4C |
pgbench-ro       ~      ~      ~     | ~    |
pgbench-rw       0.95   0.97   0.96  | 0.96 |
dbench4          ~      ~      ~     | ~    |
netperf-udp      ~      ~      ~     | ~    |
netperf-tcp      ~      ~      ~     | ~    |
tbench4          1.17   1.09   1.08  | 1.10 |
kernbench        ~      ~      ~     | ~    |
gitsource        1.06   1.40   1.40  | 1.40 |
                                     +------+

48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |   12C
pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
pgbench-rw       ~      0.86   ~     | ~    |  0.86
dbench4          ~      1.02   1.02  | 1.02 |  ~
netperf-udp      ~      0.97   1.03  | 1.02 |  ~
netperf-tcp      0.96   ~      ~     | ~    |  ~
tbench4          1.24   ~      1.06  | 1.05 |  1.11
kernbench        0.97   0.97   0.98  | 0.97 |  0.96
gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                     +------+

These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.

+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+

The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.

Subsequent patches will address:

* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont

+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+

Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:

    db-pgbench-timed-ro-small-xfs
    db-pgbench-timed-rw-small-xfs
    io-dbench4-async-xfs
    network-netperf-unbound
    network-tbench
    scheduler-unbound
    workload-kerndevel-xfs
    workload-shellscripts-xfs
    hpc-nas-c-class-mpi-full-xfs
    hpc-nas-c-class-omp-full

All those benchmarks are generally available on the web:

pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Doug Smythies <dsmythies@telus.net>
---
 arch/x86/include/asm/topology.h |  23 ++++++
 arch/x86/kernel/smpboot.c       | 176 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c             |   1 +
 kernel/sched/sched.h            |   7 ++
 4 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d2318251..9b3aca463c8f 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void)
 }
 #endif /* CONFIG_SCHED_MC_PRIO */
 
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
+
+DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+	if (arch_scale_freq_invariant())
+		return per_cpu(arch_cpu_freq, cpu);
+
+	return 1024 /* SCHED_CAPACITY_SCALE */;
+}
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+extern void arch_scale_freq_tick(void);
+#define arch_scale_freq_tick arch_scale_freq_tick
+
+#endif
+
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 69881b2d446c..814d7900779d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
 	*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
 }
 
+static void set_cpu_max_freq(void);
+
 /*
  * Report back to the Boot Processor during boot time or to the caller processor
  * during CPU online.
@@ -183,6 +185,8 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
+	set_cpu_max_freq();
+
 	/*
 	 * Get our bogomips.
 	 * Update loops_per_jiffy in cpu_data. Previous call to
@@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	set_sched_topology(x86_topology);
 
 	set_cpu_sibling_map(0);
-
+	set_cpu_max_freq();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {
@@ -1764,3 +1768,173 @@ void native_play_dead(void)
 }
 
 #endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ *   BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+	u64 misc_en;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+	if (err)
+		return false;
+
+	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+	{X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+	ICPU(INTEL_FAM6_XEON_PHI_KNL),
+	ICPU(INTEL_FAM6_XEON_PHI_KNM),
+	{}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+	ICPU(INTEL_FAM6_SKYLAKE_X),
+	{}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT_D),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
+	{}
+};
+
+static void core_set_cpu_max_freq(void)
+{
+	u64 ratio, turbo_ratio;
+	int err;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio);
+	if (err)
+		return;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio);
+	if (err)
+		return;
+
+	ratio = (ratio >> 8) & 0xFF;                /* max P state ratio */
+	turbo_ratio = (turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
+
+	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+}
+
+static void intel_set_cpu_max_freq(void)
+{
+	/*
+	 * TODO: add support for:
+	 *
+	 * - Xeon Gold/Platinum
+	 * - Xeon Phi (KNM, KNL)
+	 * - Atom Goldmont
+	 * - Atom Silvermont
+	 *
+	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
+	 */
+
+	static_branch_enable(&arch_scale_freq_key);
+
+	if (turbo_disabled() ||
+		x86_match_cpu(has_skx_turbo_ratio_limits) ||
+		x86_match_cpu(has_knl_turbo_ratio_limits) ||
+		x86_match_cpu(has_glm_turbo_ratio_limits))
+		return;
+
+	core_set_cpu_max_freq();
+}
+
+static void init_scale_freq(void *arg)
+{
+	u64 aperf, mperf;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+static void set_cpu_max_freq(void)
+{
+	if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return;
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		intel_set_cpu_max_freq();
+
+	init_scale_freq(NULL);
+}
+
+DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
+
+void arch_scale_freq_tick(void)
+{
+	u64 freq;
+	u64 aperf, mperf;
+	u64 acnt, mcnt;
+
+	if (!arch_scale_freq_invariant())
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	acnt = aperf - this_cpu_read(arch_prev_aperf);
+	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	if (!mcnt)
+		return;
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+
+	acnt <<= 2*SCHED_CAPACITY_SHIFT;
+	mcnt *= arch_max_freq;
+
+	freq = div64_u64(acnt, mcnt);
+
+	if (freq > SCHED_CAPACITY_SCALE)
+		freq = SCHED_CAPACITY_SCALE;
+
+	this_cpu_write(arch_cpu_freq, freq);
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f2eb3629070..1d0f5df8020e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3593,6 +3593,7 @@ void scheduler_tick(void)
 	struct task_struct *curr = rq->curr;
 	struct rq_flags rf;
 
+	arch_scale_freq_tick();
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8870c5bd7df..4c4f5f197e10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1977,6 +1977,13 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #endif /* CONFIG_SCHED_HRTICK */
 
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
 #ifndef arch_scale_freq_capacity
 static __always_inline
 unsigned long arch_scale_freq_capacity(int cpu)
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 2/6] x86,sched: Add support for frequency invariance on SKYLAKE_X
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 3/6] x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM Giovanni Gherdovich
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
be sustained by a group of at least 4 cores.

From the changelog of commit 31e07522be56 ("tools/power turbostat: fix
decoding for GLM, DNV, SKX turbo-ratio limits"):

>   Newer processors do not hard-code the the number of cpus in each bin
>   to {1, 2, 3, 4, 5, 6, 7, 8}  Rather, they can specify any number
>   of CPUS in each of the 8 bins:
>
>   eg.
>
>   ...
>   37 * 100.0 = 3600.0 MHz max turbo 4 active cores
>   38 * 100.0 = 3700.0 MHz max turbo 3 active cores
>   39 * 100.0 = 3800.0 MHz max turbo 2 active cores
>   39 * 100.0 = 3900.0 MHz max turbo 1 active cores
>
>   could now look something like this:
>
>   ...
>   37 * 100.0 = 3600.0 MHz max turbo 16 active cores
>   38 * 100.0 = 3700.0 MHz max turbo 8 active cores
>   39 * 100.0 = 3800.0 MHz max turbo 4 active cores
>   39 * 100.0 = 3900.0 MHz max turbo 2 active cores

This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
but we treat these two classes in separate commits because their freq_max
values need to be different. For SKX we prefer a lower freq_max in the ratio
freq_curr/freq_max, allowing load and utilization to overshoot and the
schedutil governor to be more performance-oriented. Models from the Atom
series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
favor power-efficiency over performance.

Results from a performance evaluation follow.

1. TEST SETUP
2. NEUTRAL BENCHMARKS
3. NON-NEUTRAL BENCHMARKS
4. DETAILED TABLES

1. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
Fam/Mod/Ste : 6:85:6
Topology    : 2 sockets, 24 cores / 48 threads each socket
Memory      : 192G
Storage     : SSD, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    2400 |************************
    24C     3100 |*******************************
    20C     3300 |*********************************
    16C     3600 |************************************
    12C     3600 |************************************
    8C      3600 |************************************
    4C      3700 |*************************************
    2C      3900 |***************************************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave+HWP
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

2. NEUTRAL BENCHMARKS
---------------------

* pgbench read/write
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf

3. NON-NEUTRAL BENCHMARKS
-------------------------

comparison ratio with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    pgbench read-only     1.10             ~
    tbench                1.82          1.14

comparison ratio with baseline; 1.00 means neutral, lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                   ~          0.97
    kernbench             0.88          0.78
    gitsource[*]             ~          0.46

[*] "gitsource" consists in running git's unit tests
tilde (~) means 1.00, ie result identical to baseline

Performance per watt:

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

		      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.92          0.91
    tbench                1.26          1.04
    kernbench             0.95          0.96
    gitsource             1.03          1.30

Similarly to earlier Xeons, measurable performance gains over non-invariant
schedutil are observed on dbench, tbench, kernel compilation and running the
git unit tests suite. Looking at the detailed tables show that the patch
scores the largest difference when the machine is lightly loaded. Power
efficiency suffers lightly on kernbench and a bit more on dbench, but largely
improves on gitsource (which also runs considerably faster). For reference, we
also report results using active intel_pstate with powersave and HWP; the
largest gap between non-invariant schedutil and intel_pstate+powersave is
still tbench, which runs 82% better and with 26% improved efficiency on the
latter configuration -- this divide isn't closed yet by frequency-invariant
schedutil.

4. DETAILED TABLES
------------------

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                     5.2.0 vanilla (BASELINE)            5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         183.56  +- 0.21% (        )       516.12  +- 0.57% ( 181.18%)       185.59  +- 0.59% (   1.11%)
Hmean   2         365.75  +- 0.25% (        )      1015.14  +- 0.33% ( 177.55%)       402.59  +- 4.48% (  10.07%)
Hmean   4         720.99  +- 0.44% (        )      1951.75  +- 0.28% ( 170.70%)       738.39  +- 1.72% (   2.41%)
Hmean   8        1449.93  +- 0.34% (        )      3830.56  +- 0.24% ( 164.19%)      1750.36  +- 4.65% (  20.72%)
Hmean   16       2874.26  +- 0.57% (        )      7381.62  +- 0.53% ( 156.82%)      4348.35  +- 2.22% (  51.29%)
Hmean   32       6116.17  +- 5.10% (        )     13013.05  +- 0.08% ( 112.76%)      8980.35  +- 0.66% (  46.83%)
Hmean   64      14485.04  +- 3.46% (        )     17835.12  +- 0.35% (  23.13%)     16540.73  +- 0.51% (  14.19%)
Hmean   128     30779.16  +- 3.20% (        )     32796.94  +- 2.13% (   6.56%)     31512.58  +- 0.20% (   2.38%)
Hmean   256     34664.66  +- 0.81% (        )     34604.67  +- 0.46% (  -0.17%)     34943.70  +- 0.25% (   0.80%)
Hmean   384     33957.51  +- 0.11% (        )     34091.50  +- 0.14% (   0.39%)     33921.41  +- 0.09% (  -0.11%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                    5.2.0 vanilla (BASELINE)             5.2.0 intel_pstate/HWP                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean   2        332.94  +- 0.40% (        )        260.16  +- 0.45% (  21.86%)        233.56  +- 0.21% (  29.85%)
Amean   4        173.04  +- 0.43% (        )        138.76  +- 0.03% (  19.81%)        123.59  +- 0.11% (  28.58%)
Amean   8         89.65  +- 0.20% (        )         73.54  +- 0.09% (  17.97%)         65.69  +- 0.10% (  26.72%)
Amean   16        48.08  +- 1.41% (        )         41.64  +- 1.61% (  13.40%)         36.00  +- 1.80% (  25.11%)
Amean   32        28.78  +- 0.72% (        )         26.61  +- 1.99% (   7.55%)         23.19  +- 1.68% (  19.43%)
Amean   64        20.46  +- 1.85% (        )         19.76  +- 0.35% (   3.42%)         17.38  +- 0.92% (  15.06%)
Amean   128       18.69  +- 1.70% (        )         17.59  +- 1.04% (   5.90%)         15.73  +- 1.40% (  15.85%)
Amean   192       18.82  +- 1.01% (        )         17.76  +- 0.77% (   5.67%)         15.57  +- 1.80% (  17.28%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                 5.2.0 vanilla (BASELINE)           5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         792.49  +- 0.20% (        )      779.35  +- 0.24% (   1.66%)      427.14  +- 0.16% (   46.10%)

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/kernel/smpboot.c | 79 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 66 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 814d7900779d..11d57d741584 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,23 +1841,71 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
 	{}
 };
 
-static void core_set_cpu_max_freq(void)
+static int get_turbo_ratio_group(u64 *turbo_ratio)
+{
+	u64 ratio, core_counts;
+	u32 group_size = 0;
+	int err, i, found = 0;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratio);
+	if (err)
+		return err;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &core_counts);
+	if (err)
+		return err;
+
+	for (i = 0; i < 64; i += 8) {
+		group_size = (core_counts >> i) & 0xFF;
+		if (group_size >= 4) {
+			*turbo_ratio = (ratio >> i) & 0xFF;
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return 1;
+
+	return 0;
+}
+
+static bool skx_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
 {
-	u64 ratio, turbo_ratio;
 	int err;
 
-	err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio);
+	if (!x86_match_cpu(has_skx_turbo_ratio_limits))
+		return false;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio);
 	if (err)
-		return;
+		return false;
 
-	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio);
+	err = get_turbo_ratio_group(turbo_ratio);  /* 4C (circa) turbo ratio */
 	if (err)
-		return;
+		return false;
 
-	ratio = (ratio >> 8) & 0xFF;                /* max P state ratio */
-	turbo_ratio = (turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
+	*ratio = (*ratio >> 8) & 0xFF;                /* max P state ratio */
 
-	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+	return true;
+}
+
+static bool core_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
+{
+	int err;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_ratio);
+	if (err)
+		return false;
+
+	*ratio = (*ratio >> 8) & 0xFF;                /* max P state ratio */
+	*turbo_ratio = (*turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
+
+	return true;
 }
 
 static void intel_set_cpu_max_freq(void)
@@ -1865,7 +1913,6 @@ static void intel_set_cpu_max_freq(void)
 	/*
 	 * TODO: add support for:
 	 *
-	 * - Xeon Gold/Platinum
 	 * - Xeon Phi (KNM, KNL)
 	 * - Atom Goldmont
 	 * - Atom Silvermont
@@ -1873,15 +1920,21 @@ static void intel_set_cpu_max_freq(void)
 	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
 	 */
 
-	static_branch_enable(&arch_scale_freq_key);
+	u64 ratio = 1, turbo_ratio = 1;
 
 	if (turbo_disabled() ||
-		x86_match_cpu(has_skx_turbo_ratio_limits) ||
 		x86_match_cpu(has_knl_turbo_ratio_limits) ||
 		x86_match_cpu(has_glm_turbo_ratio_limits))
 		return;
 
-	core_set_cpu_max_freq();
+	if (skx_set_cpu_max_freq(&ratio, &turbo_ratio))
+		goto set_value;
+
+	core_set_cpu_max_freq(&ratio, &turbo_ratio);
+
+set_value:
+	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+	static_branch_enable(&arch_scale_freq_key);
 }
 
 static void init_scale_freq(void *arg)
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 3/6] x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 2/6] x86,sched: Add support for frequency invariance on SKYLAKE_X Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 4/6] x86,sched: Add support for frequency invariance on ATOM_GOLDMONT* Giovanni Gherdovich
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.

Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.

We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.

For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):

    [0] -- Reserved
    [7:1] -- Base value of number of active cores of bucket 1.
    [15:8] -- Base value of freq ratio of bucket 1.
    [20:16] -- +ve delta of number of active cores of bucket 2.
    i.e. active cores of bucket 2 =
    active cores of bucket 1 + delta
    [23:21] -- Negative delta of freq ratio of bucket 2.
    i.e. freq ratio of bucket 2 =
    freq ratio of bucket 1 - delta
    [28:24]-- +ve delta of number of active cores of bucket 3.
    [31:29]-- -ve delta of freq ratio of bucket 3.
    [36:32]-- +ve delta of number of active cores of bucket 4.
    [39:37]-- -ve delta of freq ratio of bucket 4.
    [44:40]-- +ve delta of number of active cores of bucket 5.
    [47:45]-- -ve delta of freq ratio of bucket 5.
    [52:48]-- +ve delta of number of active cores of bucket 6.
    [55:53]-- -ve delta of freq ratio of bucket 6.
    [60:56]-- +ve delta of number of active cores of bucket 7.
    [63:61]-- -ve delta of freq ratio of bucket 7.

1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP

1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------

A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:

comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.04          1.05

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.03          1.04

which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.

As the results above are averaged over the varying parameter, here the detailed
table.

Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                    5.2.0 vanilla (BASELINE)                 5.2.0 intel_pstate                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         49.06  +- 2.12% (        )         51.66  +- 1.52% (   5.30%)         52.87  +- 0.88% (   7.76%)
Hmean   2         93.82  +- 0.45% (        )        103.24  +- 0.70% (  10.05%)        105.90  +- 0.70% (  12.88%)
Hmean   4        192.46  +- 1.15% (        )        215.95  +- 0.60% (  12.21%)        215.78  +- 1.43% (  12.12%)
Hmean   8        406.74  +- 2.58% (        )        438.58  +- 0.36% (   7.83%)        437.61  +- 0.97% (   7.59%)
Hmean   16       857.70  +- 1.22% (        )        890.26  +- 0.72% (   3.80%)        889.11  +- 0.73% (   3.66%)
Hmean   32      1760.10  +- 0.92% (        )       1791.70  +- 0.44% (   1.79%)       1787.95  +- 0.44% (   1.58%)
Hmean   64      3183.50  +- 0.34% (        )       3183.19  +- 0.36% (  -0.01%)       3187.53  +- 0.36% (   0.13%)
Hmean   128     4830.96  +- 0.31% (        )       4846.53  +- 0.30% (   0.32%)       4855.86  +- 0.30% (   0.52%)
Hmean   256     5467.98  +- 0.38% (        )       5793.80  +- 0.28% (   5.96%)       5821.94  +- 0.17% (   6.47%)
Hmean   512     5398.10  +- 0.06% (        )       5745.56  +- 0.08% (   6.44%)       5503.68  +- 0.07% (   1.96%)
Hmean   1024    5290.43  +- 0.63% (        )       5221.07  +- 0.47% (  -1.31%)       5277.22  +- 0.80% (  -0.25%)
Hmean   1088    5139.71  +- 0.57% (        )       5236.02  +- 0.71% (   1.87%)       5190.57  +- 0.41% (   0.99%)

2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------

* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)

3. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology    : 1 socket, 68 cores / 272 threads
Memory      : 96G
Storage     : rotary, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    1100 |***********
    68C     1100 |***********
    30C     1200 |************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/kernel/smpboot.c | 54 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 11d57d741584..0e79dcc03ae4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,6 +1841,55 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
 	{}
 };
 
+static int get_knl_turbo_ratio(u64 *turbo_ratio)
+{
+	u64 msr;
+	u32 ratio, delta_ratio;
+	int err, i, found = 0;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+	if (err)
+		return err;
+
+	ratio = (msr >> 8) & 0xFF;
+
+	for (i = 16; i < 64; i += 8) {
+		delta_ratio = (msr >> (i + 5)) & 0x7;
+		if (delta_ratio) {
+			*turbo_ratio = ratio - delta_ratio;
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return 1;
+
+	return 0;
+}
+
+static bool knl_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
+{
+	int err;
+
+	if (!x86_match_cpu(has_knl_turbo_ratio_limits))
+		return false;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio);
+	if (err)
+		return false;
+
+	/* second highest turbo ratio */
+	err = get_knl_turbo_ratio(turbo_ratio);
+	if (err)
+		return false;
+
+	/* max P state ratio */
+	*ratio = (*ratio >> 8) & 0xFF;
+
+	return true;
+}
+
 static int get_turbo_ratio_group(u64 *turbo_ratio)
 {
 	u64 ratio, core_counts;
@@ -1913,7 +1962,6 @@ static void intel_set_cpu_max_freq(void)
 	/*
 	 * TODO: add support for:
 	 *
-	 * - Xeon Phi (KNM, KNL)
 	 * - Atom Goldmont
 	 * - Atom Silvermont
 	 *
@@ -1923,10 +1971,12 @@ static void intel_set_cpu_max_freq(void)
 	u64 ratio = 1, turbo_ratio = 1;
 
 	if (turbo_disabled() ||
-		x86_match_cpu(has_knl_turbo_ratio_limits) ||
 		x86_match_cpu(has_glm_turbo_ratio_limits))
 		return;
 
+	if (knl_set_cpu_max_freq(&ratio, &turbo_ratio))
+		goto set_value;
+
 	if (skx_set_cpu_max_freq(&ratio, &turbo_ratio))
 		goto set_value;
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 4/6] x86,sched: Add support for frequency invariance on ATOM_GOLDMONT*
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
                   ` (2 preceding siblings ...)
  2019-11-13 12:46 ` [PATCH v4 3/6] x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM Giovanni Gherdovich
  2019-11-13 12:46 ` [PATCH v4 6/6] x86: intel_pstate: handle runtime turbo disablement/enablement in freq. invariance Giovanni Gherdovich
  5 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.

The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/kernel/smpboot.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0e79dcc03ae4..4d192abf337d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1795,6 +1795,10 @@ void native_play_dead(void)
  * which would ignore the entire turbo range (a conspicuous part, making
  * freq_curr/freq_max always maxed out).
  *
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
  * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
  * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
  */
@@ -1841,6 +1845,27 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
 	{}
 };
 
+static bool glm_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
+{
+	int err;
+
+	if (!x86_match_cpu(has_glm_turbo_ratio_limits))
+		return false;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_ratio);
+	if (err)
+		return false;
+
+	*ratio = (*ratio >> 8) & 0xFF;        /* max P state ratio */
+	*turbo_ratio = *turbo_ratio & 0xFF;   /* highest turbo ratio */
+
+	return true;
+}
+
 static int get_knl_turbo_ratio(u64 *turbo_ratio)
 {
 	u64 msr;
@@ -1962,7 +1987,6 @@ static void intel_set_cpu_max_freq(void)
 	/*
 	 * TODO: add support for:
 	 *
-	 * - Atom Goldmont
 	 * - Atom Silvermont
 	 *
 	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
@@ -1970,10 +1994,12 @@ static void intel_set_cpu_max_freq(void)
 
 	u64 ratio = 1, turbo_ratio = 1;
 
-	if (turbo_disabled() ||
-		x86_match_cpu(has_glm_turbo_ratio_limits))
+	if (turbo_disabled())
 		return;
 
+	if (glm_set_cpu_max_freq(&ratio, &turbo_ratio))
+		goto set_value;
+
 	if (knl_set_cpu_max_freq(&ratio, &turbo_ratio))
 		goto set_value;
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
                   ` (3 preceding siblings ...)
  2019-11-13 12:46 ` [PATCH v4 4/6] x86,sched: Add support for frequency invariance on ATOM_GOLDMONT* Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  2019-11-13 16:50   ` Srinivas Pandruvada
  2019-11-13 12:46 ` [PATCH v4 6/6] x86: intel_pstate: handle runtime turbo disablement/enablement in freq. invariance Giovanni Gherdovich
  5 siblings, 1 reply; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.

We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.

Test machine:

Platform    : Dell Wyse 3040 Thin Client[1]
CPU Model   : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology    : 1 socket, 4 cores / 4 threads
Memory      : 2G
Storage     : onboard flash, XFS filesystem

[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client

Base frequency and available turbo levels (MHz):

    Min Operating Freq   266 |***
    Low Freq Mode        800 |********
    Base Freq           2400 |************************
    4 Cores             2800 |****************************
    3 Cores             2800 |****************************
    2 Cores             3200 |********************************
    1 Core              3200 |********************************

Tested kernels:

Baseline      : v5.4-rc1,              intel_pstate passive,  schedutil
Comparison #1 : v5.4-rc1,              intel_pstate active ,  powersave
Comparison #2 : v5.4-rc1, this patch,  intel_pstate passive,  schedutil

tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.

comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.90             ~
    kernbench             0.98          0.97
    gitsource             0.63          0.43

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/kernel/smpboot.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 4d192abf337d..8988177064be 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1821,6 +1821,24 @@ static bool turbo_disabled(void)
 	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
 }
 
+static bool slv_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
+{
+	int err;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, ratio);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_ratio);
+	if (err)
+		return false;
+
+	*ratio = (*ratio >> 16) & 0x3F;      /* max P state ratio */
+	*turbo_ratio = *turbo_ratio & 0x3F;  /* 1C turbo ratio */
+
+	return true;
+}
+
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 
@@ -1984,19 +2002,14 @@ static bool core_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
 
 static void intel_set_cpu_max_freq(void)
 {
-	/*
-	 * TODO: add support for:
-	 *
-	 * - Atom Silvermont
-	 *
-	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
-	 */
-
 	u64 ratio = 1, turbo_ratio = 1;
 
 	if (turbo_disabled())
 		return;
 
+	if (slv_set_cpu_max_freq(&ratio, &turbo_ratio))
+		goto set_value;
+
 	if (glm_set_cpu_max_freq(&ratio, &turbo_ratio))
 		goto set_value;
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 6/6] x86: intel_pstate: handle runtime turbo disablement/enablement in freq. invariance
  2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
                   ` (4 preceding siblings ...)
  2019-11-13 12:46 ` [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM Giovanni Gherdovich
@ 2019-11-13 12:46 ` Giovanni Gherdovich
  5 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-13 12:46 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.

The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between 1024 and freq_max every time turbo goes off or on.

Reminder: "freq_max" isn't really the maximum frequency but the ratio
turbo_freq*1024/base_freq, which becomes 1024 when turbo is disabled. The name
could be misleading but makes explainations a lot simpler and is really
convenient in calculations.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/include/asm/topology.h |  5 +++++
 arch/x86/kernel/smpboot.c       | 14 ++++++++++----
 drivers/cpufreq/intel_pstate.c  |  1 +
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9b3aca463c8f..4834bde53b5a 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -214,6 +214,11 @@ static inline long arch_scale_freq_capacity(int cpu)
 extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
+extern void set_arch_max_freq(bool turbo_disabled);
+#else
+static inline void set_arch_max_freq(bool turbo_disabled)
+{
+}
 #endif
 
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8988177064be..f94aa1dfc778 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1807,8 +1807,15 @@ DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
 
 static DEFINE_PER_CPU(u64, arch_prev_aperf);
 static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_max_turbo_freq = SCHED_CAPACITY_SCALE;
 static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
 
+void set_arch_max_freq(bool turbo_disabled)
+{
+	arch_max_freq = turbo_disabled ? SCHED_CAPACITY_SCALE :
+					arch_max_turbo_freq;
+}
+
 static bool turbo_disabled(void)
 {
 	u64 misc_en;
@@ -2004,9 +2011,6 @@ static void intel_set_cpu_max_freq(void)
 {
 	u64 ratio = 1, turbo_ratio = 1;
 
-	if (turbo_disabled())
-		return;
-
 	if (slv_set_cpu_max_freq(&ratio, &turbo_ratio))
 		goto set_value;
 
@@ -2022,7 +2026,9 @@ static void intel_set_cpu_max_freq(void)
 	core_set_cpu_max_freq(&ratio, &turbo_ratio);
 
 set_value:
-	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+	arch_max_turbo_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE,
+					ratio);
+	set_arch_max_freq(turbo_disabled());
 	static_branch_enable(&arch_scale_freq_key);
 }
 
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 8ab31702cf6a..6bf50783bc24 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -922,6 +922,7 @@ static void intel_pstate_update_limits(unsigned int cpu)
 	 */
 	if (global.turbo_disabled_mf != global.turbo_disabled) {
 		global.turbo_disabled_mf = global.turbo_disabled;
+		set_arch_max_freq(global.turbo_disabled);
 		for_each_possible_cpu(cpu)
 			intel_pstate_update_max_freq(cpu);
 	} else {
-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM
  2019-11-13 12:46 ` [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM Giovanni Gherdovich
@ 2019-11-13 16:50   ` Srinivas Pandruvada
  2019-11-15 10:34     ` Giovanni Gherdovich
  0 siblings, 1 reply; 19+ messages in thread
From: Srinivas Pandruvada @ 2019-11-13 16:50 UTC (permalink / raw)
  To: Giovanni Gherdovich, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies

On Wed, 2019-11-13 at 13:46 +0100, Giovanni Gherdovich wrote:
> The scheduler needs the ratio freq_curr/freq_max for frequency-
> invariant
> accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the
> 1-core
> turbo ratio.
> 
> We intended to perform tests validating that this patch doesn't
> regress in
> terms of energy efficiency, given that this is the primary concern on
> Atom
> processors. Alas, we found out that turbostat doesn't support reading
> RAPL
> interfaces on our test machine (Airmont), and we don't have external
> equipment
> to measure power consumption; all we have is the performance results
> of the
> benchmarks we ran.
> 
I can run some benchmarks on this.

Thanks,
Srinivas

> Test machine:
> 
> Platform    : Dell Wyse 3040 Thin Client[1]
> CPU Model   : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
> Fam/Mod/Ste : 6:76:4
> Topology    : 1 socket, 4 cores / 4 threads
> Memory      : 2G
> Storage     : onboard flash, XFS filesystem
> 
> [1] 
> https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client
> 
> Base frequency and available turbo levels (MHz):
> 
>     Min Operating Freq   266 |***
>     Low Freq Mode        800 |********
>     Base Freq           2400 |************************
>     4 Cores             2800 |****************************
>     3 Cores             2800 |****************************
>     2 Cores             3200 |********************************
>     1 Core              3200 |********************************
> 
> Tested kernels:
> 
> Baseline      : v5.4-rc1,              intel_pstate
> passive,  schedutil
> Comparison #1 : v5.4-rc1,              intel_pstate active
> ,  powersave
> Comparison #2 : v5.4-rc1, this patch,  intel_pstate
> passive,  schedutil
> 
> tbench, hackbench and kernbench performed the same under all three
> kernels;
> dbench ran faster with intel_pstate/powersave and the git unit tests
> were a
> lot faster with intel_pstate/powersave and invariant schedutil wrt
> the
> baseline. Not that any of this is terrbily interesting anyway, one
> doesn't buy
> an Atom system to go fast. Power consumption regressions aren't
> expected but
> we lack the equipment to make that measurement. Turbostat seems to
> think that
> reading RAPL on this machine isn't a good idea and we're trusting
> that
> decision.
> 
> comparison ratio of performance with baseline; 1.00 means neutral,
> lower is better:
> 
>                       I_PSTATE      FREQ-INV
>     ----------------------------------------
>     dbench                0.90             ~
>     kernbench             0.98          0.97
>     gitsource             0.63          0.43
> 
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> ---
>  arch/x86/kernel/smpboot.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 4d192abf337d..8988177064be 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1821,6 +1821,24 @@ static bool turbo_disabled(void)
>  	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
>  }
>  
> +static bool slv_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio)
> +{
> +	int err;
> +
> +	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, ratio);
> +	if (err)
> +		return false;
> +
> +	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_ratio);
> +	if (err)
> +		return false;
> +
> +	*ratio = (*ratio >> 16) & 0x3F;      /* max P state ratio */
> +	*turbo_ratio = *turbo_ratio & 0x3F;  /* 1C turbo ratio */
> +
> +	return true;
> +}
> +
>  #include <asm/cpu_device_id.h>
>  #include <asm/intel-family.h>
>  
> @@ -1984,19 +2002,14 @@ static bool core_set_cpu_max_freq(u64 *ratio,
> u64 *turbo_ratio)
>  
>  static void intel_set_cpu_max_freq(void)
>  {
> -	/*
> -	 * TODO: add support for:
> -	 *
> -	 * - Atom Silvermont
> -	 *
> -	 * which all now get by default arch_max_freq =
> SCHED_CAPACITY_SCALE
> -	 */
> -
>  	u64 ratio = 1, turbo_ratio = 1;
>  
>  	if (turbo_disabled())
>  		return;
>  
> +	if (slv_set_cpu_max_freq(&ratio, &turbo_ratio))
> +		goto set_value;
> +
>  	if (glm_set_cpu_max_freq(&ratio, &turbo_ratio))
>  		goto set_value;
>  


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM
  2019-11-13 16:50   ` Srinivas Pandruvada
@ 2019-11-15 10:34     ` Giovanni Gherdovich
  0 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-15 10:34 UTC (permalink / raw)
  To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies

On Wed, 2019-11-13 at 08:50 -0800, Srinivas Pandruvada wrote:
> On Wed, 2019-11-13 at 13:46 +0100, Giovanni Gherdovich wrote:
> > The scheduler needs the ratio freq_curr/freq_max for frequency-
> > invariant
> > accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the
> > 1-core
> > turbo ratio.
> > 
> > We intended to perform tests validating that this patch doesn't
> > regress in
> > terms of energy efficiency, given that this is the primary concern on
> > Atom
> > processors. Alas, we found out that turbostat doesn't support reading
> > RAPL
> > interfaces on our test machine (Airmont), and we don't have external
> > equipment
> > to measure power consumption; all we have is the performance results
> > of the
> > benchmarks we ran.
> > 
> 
> I can run some benchmarks on this.
> 

That'd be appreciated, thank you Srinivas.


Giovanni

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
@ 2019-11-24  7:49   ` Doug Smythies
  2019-11-25  8:16     ` Doug Smythies
                       ` (2 more replies)
  2019-12-02 16:34   ` Ionela Voinescu
  1 sibling, 3 replies; 19+ messages in thread
From: Doug Smythies @ 2019-11-24  7:49 UTC (permalink / raw)
  To: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown,
	Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

Hi all,

The address list here is likely incorrect,
and this e-mail is really about a kernel 5.4
bisected regression.

It had been since mid September, and kernel 5.3-rc8 since
I had tried this, so I wanted to try it again. Call it due diligence.
I focused on my own version of the "gitsource" test.

Kernel 5.4-rc8 (as a baseline reference).

My results were extremely surprising.

As it turns out, at least on my test computer, both the
acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers
using the schedutil governor are broken. For the tests that
I ran, there is negligible difference between them and the
performance governor. So, one might argue that they are not
broken, but rather working incredibly well, which if true
then this patch is no longer needed.

I bisected the kernel and got:

first bad commit: [04cbfba6208592999d7bfe6609ec01dc3fde73f5]
Merge tag 'dmaengine-5.4-rc1' of git://git.infradead.org/users/vkoul/slave-dma

Which did not make any sense at all. I don't even know how
this is being pulled into my kernel compile.
O.K., I often (usually) make a mistake
during bisection, so I did it again, and got the same result.

Relevant excerpt from the commit:

diff --cc drivers/dma/Kconfig
index 413efef,03fa0c5..7c511e3
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@@ -294,8 -294,8 +294,8 @@@ config INTEL_IOATDM
          If unsure, say N.

  config INTEL_IOP_ADMA
 -      tristate "Intel IOP ADMA support"
 -      depends on ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX
 +      tristate "Intel IOP32x ADMA support"
-       depends on ARCH_IOP32X
++      depends on ARCH_IOP32X || COMPILE_TEST
        select DMA_ENGINE
        select ASYNC_TX_ENABLE_CHANNEL_SWITCH
        help

If I revert the above, manually, then everything behaves
as expected (minimally tested only, so far).

Are others seeing the schedutil governors not working as
expected with any of kernels 5.4-rc1 - 5.4-rc8?

I do have a pretty graph of my method of doing the
"gitsource" test, but am not ready to post it yet.
Here is some gitsource test data, 6 runs of "make test",
the first run is discarded:

"gg 6" means this 6 patch set.

Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds
Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds
Ratio: 0.70 (as expected)
Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected)
Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected)
Ratio: 0.97 (not as expected)
Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds
Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds
Re-stated from previous e-mail:
Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times)

... Doug



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-24  7:49   ` Doug Smythies
@ 2019-11-25  8:16     ` Doug Smythies
  2019-11-25  9:16     ` Mel Gorman
  2019-11-25 16:06     ` Giovanni Gherdovich
  2 siblings, 0 replies; 19+ messages in thread
From: Doug Smythies @ 2019-11-25  8:16 UTC (permalink / raw)
  To: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown,
	Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

On 2019.11.23 23:50 Doug Smythies wrote:

> Hi all,
>
> The address list here is likely incorrect,
> and this e-mail is really about a kernel 5.4
> bisected regression.
>
> It had been since mid September, and kernel 5.3-rc8 since
> I had tried this, so I wanted to try it again. Call it due diligence.
> I focused on my own version of the "gitsource" test.
>
> Kernel 5.4-rc8 (as a baseline reference).
>
> My results were extremely surprising.
>
> As it turns out, at least on my test computer, both the
> acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers
> using the schedutil governor are broken. For the tests that
> I ran, there is negligible difference between them and the
> performance governor. So, one might argue that they are not
> broken, but rather working incredibly well, which if true
> then this patch is no longer needed.

Should be able to gain better insight here with the 
intel_pstate_tracer.py utility, watching for differences
in rates of rotation between CPUs. Too late tonight.

>
> I bisected the kernel and got:
>
> first bad commit: [04cbfba6208592999d7bfe6609ec01dc3fde73f5]
> Merge tag 'dmaengine-5.4-rc1' of git://git.infradead.org/users/vkoul/slave-dma
>
> Which did not make any sense at all. I don't even know how
> this is being pulled into my kernel compile.
> O.K., I often (usually) make a mistake
> during bisection, so I did it again, and got the same result.
>
> Relevant excerpt from the commit:
>
> diff --cc drivers/dma/Kconfig
> index 413efef,03fa0c5..7c511e3
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@@ -294,8 -294,8 +294,8 @@@ config INTEL_IOATDM
>          If unsure, say N.
>
>  config INTEL_IOP_ADMA
>  -      tristate "Intel IOP ADMA support"
>  -      depends on ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX
>  +      tristate "Intel IOP32x ADMA support"
> -       depends on ARCH_IOP32X
> ++      depends on ARCH_IOP32X || COMPILE_TEST
>         select DMA_ENGINE
>         select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>         help
>
> If I revert the above, manually, then everything behaves
> as expected (minimally tested only, so far).
>
> Are others seeing the schedutil governors not working as
> expected with any of kernels 5.4-rc1 - 5.4-rc8?
>
> I do have a pretty graph of my method of doing the
> "gitsource" test, but am not ready to post it yet.

Graphs and write up (mostly the same as herein) are now
here:

http://www.smythies.com/~doug/linux/single-threaded/k54regression/index.html

The graphs are rather crowded, sorry.

> Here is some gitsource test data, 6 runs of "make test",
> the first run is discarded:
>
> "gg 6" means this 6 patch set.
>
> Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds
> Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds
> Ratio: 0.70 (as expected)

Kernel 5.4-rc8 + gg 6 + revert, forced CPU affinity performance: 2106.5 seconds

> Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected)
> Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected)
> Ratio: 0.97 (not as expected)
> Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds
> Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds
> Re-stated from previous e-mail:
> Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times)

... Doug



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-24  7:49   ` Doug Smythies
  2019-11-25  8:16     ` Doug Smythies
@ 2019-11-25  9:16     ` Mel Gorman
  2019-11-25 16:06     ` Giovanni Gherdovich
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2019-11-25  9:16 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown,
	Rafael J . Wysocki, x86, linux-pm, linux-kernel, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

On Sat, Nov 23, 2019 at 11:49:57PM -0800, Doug Smythies wrote:
> Hi all,
> 
> The address list here is likely incorrect,
> and this e-mail is really about a kernel 5.4
> bisected regression.
> 
> It had been since mid September, and kernel 5.3-rc8 since
> I had tried this, so I wanted to try it again. Call it due diligence.
> I focused on my own version of the "gitsource" test.
> 
> Kernel 5.4-rc8 (as a baseline reference).
> 
> My results were extremely surprising.
> 
> As it turns out, at least on my test computer, both the
> acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers
> using the schedutil governor are broken. For the tests that
> I ran, there is negligible difference between them and the
> performance governor. So, one might argue that they are not
> broken, but rather working incredibly well, which if true
> then this patch is no longer needed.
> 

I don't think that is necessarily fair. It still makes sense that the
"size" of a task is independent of the frequency the CPU is running at
for load balancing decisions and should be merged on that basis alone. If
the scheduler made perfect decisions on task location then there would
be negligible difference between ondemand (or powersave if intel_pstate)
and performance governor too. cpufreq tends to be more noticable under
the current implementation as tasks are a lot more mobile than they need
to be for basic workloads.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-24  7:49   ` Doug Smythies
  2019-11-25  8:16     ` Doug Smythies
  2019-11-25  9:16     ` Mel Gorman
@ 2019-11-25 16:06     ` Giovanni Gherdovich
  2019-11-26  5:59       ` Doug Smythies
  2 siblings, 1 reply; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-25 16:06 UTC (permalink / raw)
  To: Doug Smythies, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

On Sat, 2019-11-23 at 23:49 -0800, Doug Smythies wrote:
> ...
> Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds
> Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds
> Ratio: 0.70 (as expected)
> Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected)
> Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected)
> Ratio: 0.97 (not as expected)
> Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds
> Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds
> Re-stated from previous e-mail:
> Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times)

Hello Doug,

schedutil in 5.4 going a lot faster than in 5.3 would be a surprise. I'm
running that same test too to check if I can see it as well.

Besides, as it's already been said this patchset adds frequency
scale-invariance to scheduler metrics such as load and utilization and that's
useful also in areas other than frequency scaling (most notably the scheduler
load balancer).


Thanks,
Giovanni

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-25 16:06     ` Giovanni Gherdovich
@ 2019-11-26  5:59       ` Doug Smythies
  2019-11-26 15:20         ` Giovanni Gherdovich
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Smythies @ 2019-11-26  5:59 UTC (permalink / raw)
  To: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown,
	Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

On 2019.11.25 08:06 Giovanni Gherdovich wrote:
> On Sat, 2019-11-23 at 23:49 -0800, Doug Smythies wrote:
>> ...
>> Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds
>> Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds
>> Ratio: 0.70 (as expected)
>> Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected)
>> Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected)
>> Ratio: 0.97 (not as expected)
>> Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds
>> Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds
>> Re-stated from previous e-mail:
>> Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times)
>
> Hello Doug,
>
> schedutil in 5.4 going a lot faster than in 5.3 would be a surprise. I'm
> running that same test too to check if I can see it as well.

Great, thanks. But see below.

>
> Besides, as it's already been said this patchset adds frequency
> scale-invariance to scheduler metrics such as load and utilization and that's
> useful also in areas other than frequency scaling (most notably the scheduler
> load balancer).

The issue with the schedutil governor not working properly in the 5.4 RC series
appears to be hardware dependant.

My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge.
On a temporary basis, I acquired a computer with an
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell,
and schedutil governor behaviour with the exact same kernels is fine:

That "gitsource" test, "make test" 6 times, first run thrown out:

Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds
Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds
Ratio: 0.49
Recall you got a ratio of 0.49 with 5th generation, Broadwell.

... Doug



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-26  5:59       ` Doug Smythies
@ 2019-11-26 15:20         ` Giovanni Gherdovich
  2019-11-27  7:32           ` Doug Smythies
  0 siblings, 1 reply; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-11-26 15:20 UTC (permalink / raw)
  To: Doug Smythies, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann

On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote:
> [...]
> The issue with the schedutil governor not working properly in the 5.4 RC series
> appears to be hardware dependant.
> 
> My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge.
> On a temporary basis, I acquired a computer with an
> Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell,
> and schedutil governor behaviour with the exact same kernels is fine:
> 
> That "gitsource" test, "make test" 6 times, first run thrown out:
> 
> Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds
> Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds
> Ratio: 0.49
> Recall you got a ratio of 0.49 with 5th generation, Broadwell.

It's good to hear that we're getting the same performance numbers for this
patchset on all hardware that is not a Sandy Bridge. Thanks for double
checking, independent verification is always valuable.

Now, regarding the 5.4 regression for schedutil you see on Sandy Bridge: can
we move this to the kernel bugzilla? Would you care to open a bug there and CC
me to it? If it's reproducible we should assess it and see what can be done.

I've tried gitsource on 5.3 versus 5.4, using intel_cpufreq + schedutil; I
don't see the drop you're observing, but I don't have a Sandy Bridge readily
available. This is what I see:

Arithmetic mean of elapsed time for gitsource over 5 iterations (seconds):

    microarch         v5.3 (baseline)                             v5.4
    ------------------------------------------------------------------
    Haswell         1337.84  +- 0.11%     1336.35  +- 0.12% (   0.11%)
    Broadwell       1335.42  +- 0.08%     1352.54  +- 0.03% (  -1.28%)
    Skylake          887.03  +- 1.02%      870.90  +- 1.19% (   1.82%)

I'm looking around for a Sandy Bridge but I can make no promises at the
moment.


Thanks,
Giovanni

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-26 15:20         ` Giovanni Gherdovich
@ 2019-11-27  7:32           ` Doug Smythies
  2019-11-28 22:48             ` Doug Smythies
  0 siblings, 1 reply; 19+ messages in thread
From: Doug Smythies @ 2019-11-27  7:32 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Peter Zijlstra,
	Vincent Guittot, Quentin Perret, Dietmar Eggemann,
	Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J . Wysocki

On 2019.11.26 07:20 Giovanni Gherdovich wrote:
> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote:
>> [...]
>> The issue with the schedutil governor not working properly in the 5.4 RC series
>> appears to be hardware dependant.
>> 
>> My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge.
>> On a temporary basis, I acquired a computer with an
>> Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell,
>> and schedutil governor behaviour with the exact same kernels is fine:
>> 
>> That "gitsource" test, "make test" 6 times, first run thrown out:
>> 
>> Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds
>> Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds
>> Ratio: 0.49
>> Recall you got a ratio of 0.49 with 5th generation, Broadwell.
>
> It's good to hear that we're getting the same performance numbers for this
> patchset on all hardware that is not a Sandy Bridge. Thanks for double
> checking, independent verification is always valuable.
> 
> Now, regarding the 5.4 regression for schedutil you see on Sandy Bridge: can
> we move this to the kernel bugzilla? Would you care to open a bug there and CC
> me to it?

O.K., I'll need another day or two to isolate further, then I'll open a bug.
I now understand considerably more, and why my bisection ended up
at a strange spot.

> If it's reproducible we should assess it and see what can be done.

On my Sandy Bridge system if the kernel configuration contains:

CONFIG_UCLAMP_TASK_GROUP=y

Then the intel_cpufreq/schedutil will respond much like the performance
governor.

If the kernel configuration contains:

# CONFIG_UCLAMP_TASK_GROUP is not set

Then the intel_cpufreq/schedutil will respond much like it used to.

On the Haswell computer, it doesn't seem to matter, and your tests
seem to confirm this.

Note: I steal my kernel configuration from the Ubuntu mainline builds,
and they changed this parameter during the 5.4-rc series.

... Doug



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-27  7:32           ` Doug Smythies
@ 2019-11-28 22:48             ` Doug Smythies
  0 siblings, 0 replies; 19+ messages in thread
From: Doug Smythies @ 2019-11-28 22:48 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Peter Zijlstra,
	Vincent Guittot, Quentin Perret, Dietmar Eggemann,
	Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J . Wysocki

Summary: There never was an issue here.

Sorry for the noise of this thread, and the resulting waste of time.

On 2019.11.26 23:33 Doug Smythies wrote:
> On 2019.11.26 07:20 Giovanni Gherdovich wrote:
>> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote:
>>> [...]
>>> The issue with the schedutil governor not working properly in the 5.4 RC series
>>> appears to be hardware dependant.

No it 's not.

Issues with my Sandy Bridge, i7-2600K, test computer and kernel 5.4
seem to be because it is running an older Ubuntu server version,
apparently somewhat dependant on cgroup V1 and their cgmanager package.
I am unable to remove the package to test further because I do use VMs
that seem to depend on it.

In the kernel configuration when CONFIG_UCLAMP_TASK_GROUP=y
the computer behaves as though the new parameter "cpu.uclamp.min"
is set to max rather than 0, but I can not prove it.

>>> My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge.
>>> On a temporary basis, I acquired a computer with an
>>> Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell,
>>> and schedutil governor behaviour with the exact same kernels is fine:

I failed to mention that was brand new Ubuntu server 20.04 development version
installation, which turns out to have been an important omission.

>>> 
>>> That "gitsource" test, "make test" 6 times, first run thrown out:
>>> 
>>> Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds
>>> Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds
>>> Ratio: 0.49
>>> Recall you got a ratio of 0.49 with 5th generation, Broadwell.

I took the disk from the Haswell computer and put it in the Sandy Bridge
computer, and then everything was fine:

Kernel 5.4 intel_cpufreq/schedutil: 3636 seconds
Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 2427.6 seconds
Ratio: 0.67 which is consistent with previous Sandy Bridge results

...

Those above kernels included:

CONFIG_UCLAMP_TASK_GROUP=y

Which cause my older Ubuntu server edition using the 
intel_cpufreq/schedutil to respond much like the performance
governor, as reported the other day and above.

I also booted the Sandy Bridge computer, with its original disk,
with the kernel command line containing:

cgroup_no_v1=all

and, while it complained during boot, it did work as expected
using the intel_cpufreq/schedutil driver/governor.

... Doug



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
  2019-11-24  7:49   ` Doug Smythies
@ 2019-12-02 16:34   ` Ionela Voinescu
  2019-12-06 11:57     ` Giovanni Gherdovich
  1 sibling, 1 reply; 19+ messages in thread
From: Ionela Voinescu @ 2019-12-02 16:34 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki,
	x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies

Hi Giovanni,

On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich wrote:
[...]
> ---
>  arch/x86/include/asm/topology.h |  23 ++++++
>  arch/x86/kernel/smpboot.c       | 176 +++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/core.c             |   1 +
>  kernel/sched/sched.h            |   7 ++
>  4 files changed, 206 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 4b14d2318251..9b3aca463c8f 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void)
>  }
>  #endif /* CONFIG_SCHED_MC_PRIO */
>  
> +#ifdef CONFIG_SMP
> +#include <asm/cpufeature.h>
> +
> +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
> +
> +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
> +
> +DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static inline long arch_scale_freq_capacity(int cpu)
> +{
> +	if (arch_scale_freq_invariant())
> +		return per_cpu(arch_cpu_freq, cpu);
> +

I see further down in the code that you gate the setting of
arch_cpu_freq by arch_scale_freq_invariant() as well, so it might be
cleaner to remove the condition here and just return the value of the
per_cpu variable. That variable should also have an initial value of
SCHED_FREQ_CAPACITY_SCALE (1024) and if it happens that frequency
invariance is not enabled, then 1024 will always be returned as no code
would have set it to anything else.

Also, arm64 names this cpu variable freq_scale instead of arch_cpu_freq.
It would be nice to have the same name here, to easily understand
similarities in this functionality on both sides.

If arch_cpu_freq seems more complete, you might want to rename it to
arch_cpu_freq_scale, although longer, to clearly state that this is a
scale value and not an absolute frequency value.

> +	return 1024 /* SCHED_CAPACITY_SCALE */;
> +}
> +#define arch_scale_freq_capacity arch_scale_freq_capacity
> +
> +extern void arch_scale_freq_tick(void);
> +#define arch_scale_freq_tick arch_scale_freq_tick
> +
> +#endif
> +
>  #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 69881b2d446c..814d7900779d 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
[...]
> +
> +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> +
> +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> +static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
> +

Same here: the scale suffix would make the math below clearer.

[...]
> +static void intel_set_cpu_max_freq(void)
> +{
> +	/*
> +	 * TODO: add support for:
> +	 *
> +	 * - Xeon Gold/Platinum
> +	 * - Xeon Phi (KNM, KNL)
> +	 * - Atom Goldmont
> +	 * - Atom Silvermont
> +	 *
> +	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
> +	 */
> +
> +	static_branch_enable(&arch_scale_freq_key);
> +
> +	if (turbo_disabled() ||
> +		x86_match_cpu(has_skx_turbo_ratio_limits) ||
> +		x86_match_cpu(has_knl_turbo_ratio_limits) ||
> +		x86_match_cpu(has_glm_turbo_ratio_limits))
> +		return;
> +
> +	core_set_cpu_max_freq();
> +}
> +
> +static void init_scale_freq(void *arg)

This function does not initialise the frequency scale factor so the name
is confusing to me. How about init_counters_refs or init_fie_counters_refs
(fie = frequency invariance engine)?

> +{
> +	u64 aperf, mperf;
> +
> +	rdmsrl(MSR_IA32_APERF, aperf);
> +	rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +	this_cpu_write(arch_prev_aperf, aperf);
> +	this_cpu_write(arch_prev_mperf, mperf);
> +}
> +
> +static void set_cpu_max_freq(void)

Similarly for the name of this function: it seems to both set the max
frequency ratio and initialise the references to the aperf and mperf
counters. Also, in the process it enables frequency invariance.
So this function seems to do all the preparation work for frequency
invariance so a more generic name (init_fie/init_frequency_invariance)
would work better in my opinion.

> +{
> +	if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
> +		return;
> +
> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> +		intel_set_cpu_max_freq();

I see above that you enable the static key (and therefore frequency
invariance before setting the max frequency ratio (if possible) and
before you initialise the counter references. Is there any reason for
doing this?

In my mind the more clear process is:
 - Obtain and set max frequency ratio
 - Initialise counter references
 - If all above goes well enable the static key (and frequency
   invariance)

Thanks,
Ionela.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] x86,sched: Add support for frequency invariance
  2019-12-02 16:34   ` Ionela Voinescu
@ 2019-12-06 11:57     ` Giovanni Gherdovich
  0 siblings, 0 replies; 19+ messages in thread
From: Giovanni Gherdovich @ 2019-12-06 11:57 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki,
	x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming,
	Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot,
	Quentin Perret, Dietmar Eggemann, Doug Smythies,
	Giovanni Gherdovich

Hello Ionela,

thanks for your review. Comments below.

> On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich 
> wrote:
> [...]
> > ---
> >  arch/x86/include/asm/topology.h |  23 ++++++
> >  arch/x86/kernel/smpboot.c       | 176 +++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched/core.c             |   1 +
> >  kernel/sched/sched.h            |   7 ++
> >  4 files changed, 206 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index 4b14d2318251..9b3aca463c8f 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void)
> >  }
> >  #endif /* CONFIG_SCHED_MC_PRIO */
> >
> > +#ifdef CONFIG_SMP
> > +#include <asm/cpufeature.h>
> > +
> > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
> > +
> > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
> > +
> > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
> > +
> > +static inline long arch_scale_freq_capacity(int cpu)
> > +{
> > +	if (arch_scale_freq_invariant())
> > +		return per_cpu(arch_cpu_freq, cpu);
> > +
> 
> I see further down in the code that you gate the setting of
> arch_cpu_freq by arch_scale_freq_invariant() as well, so it might be
> cleaner to remove the condition here and just return the value of the
> per_cpu variable. That variable should also have an initial value of
> SCHED_FREQ_CAPACITY_SCALE (1024) and if it happens that frequency
> invariance is not enabled, then 1024 will always be returned as no code
> would have set it to anything else.

You're correct. Currently I'm not initializing arch_cpu_freq 
explicitely, but
if I set it to 1024 than I can remove the check for 
arch_scale_freq_invariant()
in the function above (arch_scale_freq_capacity) and always return
arch_cpu_freq from there no matter what. Will do that in v5.

> 
> Also, arm64 names this cpu variable freq_scale instead of 
> arch_cpu_freq.
> It would be nice to have the same name here, to easily understand
> similarities in this functionality on both sides.
> 
> If arch_cpu_freq seems more complete, you might want to rename it to
> arch_cpu_freq_scale, although longer, to clearly state that this is a
> scale value and not an absolute frequency value.
> 
> > +	return 1024 /* SCHED_CAPACITY_SCALE */;
> > +}
> > +#define arch_scale_freq_capacity arch_scale_freq_capacity
> > +
> > +extern void arch_scale_freq_tick(void);
> > +#define arch_scale_freq_tick arch_scale_freq_tick
> > +
> > +#endif
> > +
> >  #endif /* _ASM_X86_TOPOLOGY_H */
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 69881b2d446c..814d7900779d 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> [...]
> > +
> > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> > +
> > +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> > +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> > +static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
> > +
> 
> Same here: the scale suffix would make the math below clearer.

Agreed, in v5 will rename the variabled arch_cpu_freq and arch_max_freq 
to
include "scale".

> 
> [...]
> > +static void intel_set_cpu_max_freq(void)
> > +{
> > +	/*
> > +	 * TODO: add support for:
> > +	 *
> > +	 * - Xeon Gold/Platinum
> > +	 * - Xeon Phi (KNM, KNL)
> > +	 * - Atom Goldmont
> > +	 * - Atom Silvermont
> > +	 *
> > +	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
> > +	 */
> > +
> > +	static_branch_enable(&arch_scale_freq_key);
> > +
> > +	if (turbo_disabled() ||
> > +		x86_match_cpu(has_skx_turbo_ratio_limits) ||
> > +		x86_match_cpu(has_knl_turbo_ratio_limits) ||
> > +		x86_match_cpu(has_glm_turbo_ratio_limits))
> > +		return;
> > +
> > +	core_set_cpu_max_freq();
> > +}
> > +
> > +static void init_scale_freq(void *arg)
> 
> This function does not initialise the frequency scale factor so the 
> name
> is confusing to me. How about init_counters_refs or 
> init_fie_counters_refs
> (fie = frequency invariance engine)?

Good point too, I'll rename init_scale_freq() following your suggestion.

> 
> > +{
> > +	u64 aperf, mperf;
> > +
> > +	rdmsrl(MSR_IA32_APERF, aperf);
> > +	rdmsrl(MSR_IA32_MPERF, mperf);
> > +
> > +	this_cpu_write(arch_prev_aperf, aperf);
> > +	this_cpu_write(arch_prev_mperf, mperf);
> > +}
> > +
> > +static void set_cpu_max_freq(void)
> 
> Similarly for the name of this function: it seems to both set the max
> frequency ratio and initialise the references to the aperf and mperf
> counters. Also, in the process it enables frequency invariance.
> So this function seems to do all the preparation work for frequency
> invariance so a more generic name (init_fie/init_frequency_invariance)
> would work better in my opinion.

Yeah I agree, names all around here are not great, will put more thought 
into
it for v5. Thanks for the names suggestions.

> 
> > +{
> > +	if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
> > +		return;
> > +
> > +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> > +		intel_set_cpu_max_freq();
> 
> I see above that you enable the static key (and therefore frequency
> invariance before setting the max frequency ratio (if possible) and
> before you initialise the counter references. Is there any reason for
> doing this?
> 
> In my mind the more clear process is:
>  - Obtain and set max frequency ratio
>  - Initialise counter references
>  - If all above goes well enable the static key (and frequency
>    invariance)

This is a fair point; mine was a deliberate choice but you're the second
person making this remark (Peter Zijlstra also suggested I find the max
frequency before I set the static key), so it appears this design is 
unpopular
enough to warrant a change in v5.

The initialization of counter references doesn't worry me: once you 
check the
presence of the [AM]PERF registers with 
boot_cpu_has(X86_FEATURE_APERFMPERF),
there isn't much that can go wrong when reading their value.

Reading and setting the max frequency ratio before or after enabling 
frequency
invariance is more opinionated. My view is that if you fail to obtain 
the max
frequency ratio from the various MSRs, that means the processor is not 
among
the list of CPUs I've enumerated in the code, but some other model that 
has a
different method (MSR semantics) for gathering the max freq value. Eg: 
if
Intel next year makes a new processor where the max freq is written in 
some
different register (happened plenty of times in the past), and I'm not 
quick
to send a patch for that, that part of the initialization would fail.

What to do at that point? Bail out entirely from frequency invariance? 
In my
opinion that would be too conservative, I prefer to give that machine a
default max freq ratio of 1024 (max frequency == base frequency, i.e. 
pretend
it doesn't have any turbo).

In other words I think that a little frequency invariance is better than 
no
frequency invariance, and the only really essential ingredient is
X86_FEATURE_APERFMPERF to read the current frequency (which I check 
before
setting the static key).

But again: you and one other person have already suggested to do it the 
other
way, and mine is only a mild preference, so I'll change that part in v5.


Thanks,
Giovanni


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 12:46 [PATCH v4 0/6] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 1/6] x86,sched: Add support for frequency invariance Giovanni Gherdovich
2019-11-24  7:49   ` Doug Smythies
2019-11-25  8:16     ` Doug Smythies
2019-11-25  9:16     ` Mel Gorman
2019-11-25 16:06     ` Giovanni Gherdovich
2019-11-26  5:59       ` Doug Smythies
2019-11-26 15:20         ` Giovanni Gherdovich
2019-11-27  7:32           ` Doug Smythies
2019-11-28 22:48             ` Doug Smythies
2019-12-02 16:34   ` Ionela Voinescu
2019-12-06 11:57     ` Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 2/6] x86,sched: Add support for frequency invariance on SKYLAKE_X Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 3/6] x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 4/6] x86,sched: Add support for frequency invariance on ATOM_GOLDMONT* Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM Giovanni Gherdovich
2019-11-13 16:50   ` Srinivas Pandruvada
2019-11-15 10:34     ` Giovanni Gherdovich
2019-11-13 12:46 ` [PATCH v4 6/6] x86: intel_pstate: handle runtime turbo disablement/enablement in freq. invariance Giovanni Gherdovich

Linux-PM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pm/0 linux-pm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pm linux-pm/ https://lore.kernel.org/linux-pm \
		linux-pm@vger.kernel.org
	public-inbox-index linux-pm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git