* [PATCH v2 0/2] Add support for frequency invariance for (some) x86 @ 2019-10-02 12:29 Giovanni Gherdovich 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-10-02 12:29 ` [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich 0 siblings, 2 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:29 UTC (permalink / raw) To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies, Giovanni Gherdovich v1 at https://lore.kernel.org/lkml/20190909024216.5942-1-ggherdovich@suse.cz/ Changes wrt v1: - add x86-specific implementation of arch_scale_freq_invariant() using a static key that checks for the availability of APERF and MPERF - refer to GOLDMONT_D instead of GOLDMONT_X, according to recent rename - set arch_cpu_freq to 1024 from x86_arch_scale_freq_tick_disable() to prevent PELT from being fed stale data - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Cover Letter from v1: This is a resend with of Peter Zijlstra's patch to support frequency scale-invariance on x86 from May 2018 [see 1]. I've added some modifications and included performance test results. If Peter doesn't mind, I'm slapping my name on it :) The changes from Peter's original implementation are: 1) normalizing against the 4-cores turbo level instead or 1-core turbo 2) removing the run-time search for when the above value isn't found in the various Intel MSRs -- the base frequency value is taken in that case. The section "4. KNOWN LIMITATIONS" in the first patch commit message addresses the reason why this approach was dropped back in 2018, and explains that the performance gains outweight that issue. The second patch from Srinivas is taken verbatim from the May 2018 submission as it still applies. I apologies for the length of patch #1 commit message; I've made a table of contents with summaries of each section that should make easier to skim through the content. This submission incorporates the feedback and requests for additional tests received during the presentation made at OSPM 2019 in Pisa three months ago. [1] https://lore.kernel.org/lkml/20180516044911.28797-2-srinivas.pandruvada@linux.intel.com/ Giovanni Gherdovich (1): x86,sched: Add support for frequency invariance Srinivas Pandruvada (1): cpufreq: intel_pstate: Conditional frequency invariant accounting arch/x86/include/asm/topology.h | 33 +++++++ arch/x86/kernel/smpboot.c | 195 +++++++++++++++++++++++++++++++++++++++- drivers/cpufreq/intel_pstate.c | 5 ++ kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 ++ 5 files changed, 240 insertions(+), 1 deletion(-) -- 2.16.4 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:29 [PATCH v2 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich @ 2019-10-02 12:29 ` Giovanni Gherdovich 2019-10-02 15:23 ` kbuild test robot ` (3 more replies) 2019-10-02 12:29 ` [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich 1 sibling, 4 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:29 UTC (permalink / raw) To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies, Giovanni Gherdovich Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency (or base frequency if nothing else is reported by MSRs) 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). Note that when the function intel_set_cpu_max_freq() fails to query the various MSRs for the 4C turbo value, the variable arch_max_freq retains its default value of SCHED_CAPACITY_SCALE (1024) that corresponds to setting freq_max to base frequency wrt the table above. +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo * (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Acked-by: Doug Smythies <dsmythies@telus.net> --- arch/x86/include/asm/topology.h | 33 +++++++ arch/x86/kernel/smpboot.c | 195 +++++++++++++++++++++++++++++++++++++++- kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 ++ 4 files changed, 235 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..260410306879 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,37 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include <asm/cpufeature.h> + +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) + +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); + +static inline long arch_scale_freq_capacity(int cpu) +{ + if (arch_scale_freq_invariant()) + return per_cpu(arch_cpu_freq, cpu); + + return 1024 /* SCHED_CAPACITY_SCALE */; +} +#define arch_scale_freq_capacity arch_scale_freq_capacity + +extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + +extern void x86_arch_scale_freq_tick_enable(void); +extern void x86_arch_scale_freq_tick_disable(void); +#else +static inline void x86_arch_scale_freq_tick_enable(void) +{ +} + +static inline void x86_arch_scale_freq_tick_disable(void) +{ +} +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..7aa72a51e1e9 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void set_cpu_max_freq(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + set_cpu_max_freq(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + set_cpu_max_freq(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,192 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +#define ICPU(model) \ + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static void core_set_cpu_max_freq(void) +{ + u64 ratio, turbo_ratio; + int err; + + if (smp_processor_id() != 0) + return; + + if (turbo_disabled() || + x86_match_cpu(has_knl_turbo_ratio_limits) || + x86_match_cpu(has_turbo_ratio_group_limits)) + return; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); + if (err) + return; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); + if (err) + return; + + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); + + static_branch_enable(&arch_scale_freq_key); +} + +static void intel_set_cpu_max_freq(void) +{ + /* + * TODO: add support for: + * + * - Xeon Phi (KNM, KNL) + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus + * - Atom Silvermont + * + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE + */ + core_set_cpu_max_freq(); +} + +static void init_scale_freq(void *arg) +{ + u64 aperf, mperf; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void set_cpu_max_freq(void) +{ + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + switch (boot_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + intel_set_cpu_max_freq(); + break; + default: + break; + } + + init_scale_freq(NULL); +} + +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); + +static bool tick_disable; + +void arch_scale_freq_tick(void) +{ + u64 freq; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!arch_scale_freq_invariant() || tick_disable) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq; + + freq = div64_u64(acnt, mcnt); + + if (freq > SCHED_CAPACITY_SCALE) + freq = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_cpu_freq, freq); +} + +void x86_arch_scale_freq_tick_enable(void) +{ + tick_disable = false; +} + +static void reset_scale_freq(void *arg) +{ + this_cpu_write(arch_cpu_freq, SCHED_CAPACITY_SCALE); +} + +void x86_arch_scale_freq_tick_disable(void) +{ + on_each_cpu(reset_scale_freq, NULL, 1); + tick_disable = true; +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7880f4f64d0e..2bdce4a140ae 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3593,6 +3593,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; + arch_scale_freq_tick(); sched_clock_tick(); rq_lock(rq, &rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0db2c1b3361e..0fe4f2dcd175 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1953,6 +1953,13 @@ static inline int hrtick_enabled(struct rq *rq) #endif /* CONFIG_SCHED_HRTICK */ +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_freq_capacity static __always_inline unsigned long arch_scale_freq_capacity(int cpu) -- 2.16.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich @ 2019-10-02 15:23 ` kbuild test robot 2019-10-02 15:49 ` Giovanni Gherdovich 2019-10-02 16:43 ` kbuild test robot ` (2 subsequent siblings) 3 siblings, 1 reply; 25+ messages in thread From: kbuild test robot @ 2019-10-02 15:23 UTC (permalink / raw) To: Giovanni Gherdovich Cc: kbuild-all, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies, Giovanni Gherdovich [-- Attachment #1: Type: text/plain, Size: 1684 bytes --] Hi Giovanni, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/sched/core] [cannot apply to v5.4-rc1 next-20191002] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/Add-support-for-frequency-invariance-for-some-x86/20191002-221807 config: x86_64-defconfig (attached as .config) compiler: gcc-7 (Debian 7.4.0-13) 7.4.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> arch/x86/kernel/smpboot.c:1834:7: error: 'INTEL_FAM6_ATOM_GOLDMONT_D' undeclared here (not in a function); did you mean 'INTEL_FAM6_ATOM_GOLDMONT_X'? ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), ^ arch/x86/kernel/smpboot.c:1824:25: note: in definition of macro 'ICPU' { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} ^~~~~ vim +1834 arch/x86/kernel/smpboot.c 1831 1832 static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { 1833 ICPU(INTEL_FAM6_ATOM_GOLDMONT), > 1834 ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), 1835 ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), 1836 ICPU(INTEL_FAM6_SKYLAKE_X), 1837 {} 1838 }; 1839 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 28591 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 15:23 ` kbuild test robot @ 2019-10-02 15:49 ` Giovanni Gherdovich 0 siblings, 0 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 15:49 UTC (permalink / raw) To: kbuild test robot Cc: kbuild-all, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Wed, 2019-10-02 at 23:23 +0800, kbuild test robot wrote: > Hi Giovanni, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on tip/sched/core] > [cannot apply to v5.4-rc1 next-20191002] > [if your patch is applied to the wrong git tree, please drop us a note to help > improve the system. BTW, we also suggest to use '--base' option to specify the > base tree in git format-patch, please see https://stackoverflow.com/a/37406982] Noted, thanks. Indeed this patch applies to the master branch of the "tip" tree, git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git , and depends on 5ebb34edbefa8 "x86/intel: Aggregate microserver naming". I'll use '--base' in the future. Giovanni > > url: https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/Add-support-for-frequency-invariance-for-some-x86/20191002-221807 > config: x86_64-defconfig (attached as .config) > compiler: gcc-7 (Debian 7.4.0-13) 7.4.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > If you fix the issue, kindly add following tag > Reported-by: kbuild test robot <lkp@intel.com> > > All errors (new ones prefixed by >>): > > > > arch/x86/kernel/smpboot.c:1834:7: error: 'INTEL_FAM6_ATOM_GOLDMONT_D' undeclared here (not in a function); did you mean 'INTEL_FAM6_ATOM_GOLDMONT_X'? > > ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > ^ > arch/x86/kernel/smpboot.c:1824:25: note: in definition of macro 'ICPU' > { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} > ^~~~~ > > vim +1834 arch/x86/kernel/smpboot.c > > 1831 > 1832 static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > 1833 ICPU(INTEL_FAM6_ATOM_GOLDMONT), > > 1834 ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > > 1835 ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > 1836 ICPU(INTEL_FAM6_SKYLAKE_X), > 1837 {} > 1838 }; > 1839 > > --- > 0-DAY kernel test infrastructure Open Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-10-02 15:23 ` kbuild test robot @ 2019-10-02 16:43 ` kbuild test robot 2019-10-02 18:27 ` Peter Zijlstra 2019-10-03 10:27 ` Rafael J. Wysocki 3 siblings, 0 replies; 25+ messages in thread From: kbuild test robot @ 2019-10-02 16:43 UTC (permalink / raw) To: Giovanni Gherdovich Cc: kbuild-all, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies, Giovanni Gherdovich [-- Attachment #1: Type: text/plain, Size: 1694 bytes --] Hi Giovanni, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/sched/core] [cannot apply to v5.4-rc1 next-20191002] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Giovanni-Gherdovich/Add-support-for-frequency-invariance-for-some-x86/20191002-221807 config: x86_64-randconfig-a004-201939 (attached as .config) compiler: gcc-5 (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> arch/x86/kernel/smpboot.c:1834:7: error: 'INTEL_FAM6_ATOM_GOLDMONT_D' undeclared here (not in a function) ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), ^ arch/x86/kernel/smpboot.c:1824:25: note: in definition of macro 'ICPU' { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} ^ vim +/INTEL_FAM6_ATOM_GOLDMONT_D +1834 arch/x86/kernel/smpboot.c 1831 1832 static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { 1833 ICPU(INTEL_FAM6_ATOM_GOLDMONT), > 1834 ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), 1835 ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), 1836 ICPU(INTEL_FAM6_SKYLAKE_X), 1837 {} 1838 }; 1839 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 34807 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-10-02 15:23 ` kbuild test robot 2019-10-02 16:43 ` kbuild test robot @ 2019-10-02 18:27 ` Peter Zijlstra 2019-10-03 10:27 ` Rafael J. Wysocki 3 siblings, 0 replies; 25+ messages in thread From: Peter Zijlstra @ 2019-10-02 18:27 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, Rafael J . Wysocki, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Wed, Oct 02, 2019 at 02:29:25PM +0200, Giovanni Gherdovich wrote: > +void x86_arch_scale_freq_tick_enable(void) > +{ > + tick_disable = false; > +} > + > +static void reset_scale_freq(void *arg) > +{ > + this_cpu_write(arch_cpu_freq, SCHED_CAPACITY_SCALE); > +} > + > +void x86_arch_scale_freq_tick_disable(void) > +{ > + on_each_cpu(reset_scale_freq, NULL, 1); > + tick_disable = true; I'm thikning this ought to be the other way around, otherwise we can get a tick loosing the 1024 we just wrote in arch_cpu_freq. > +} You've lost the prev_{a,m}perf update, so the first tick after enable will see 'funny' values. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich ` (2 preceding siblings ...) 2019-10-02 18:27 ` Peter Zijlstra @ 2019-10-03 10:27 ` Rafael J. Wysocki 2019-10-03 12:15 ` Peter Zijlstra 3 siblings, 1 reply; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-03 10:27 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich wrote: > Implement arch_scale_freq_capacity() for 'modern' x86. This function > is used by the scheduler to correctly account usage in the face of > DVFS. > [cut] > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> > Acked-by: Doug Smythies <dsmythies@telus.net> > --- > arch/x86/include/asm/topology.h | 33 +++++++ > arch/x86/kernel/smpboot.c | 195 +++++++++++++++++++++++++++++++++++++++- > kernel/sched/core.c | 1 + > kernel/sched/sched.h | 7 ++ > 4 files changed, 235 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index 4b14d2318251..260410306879 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -193,4 +193,37 @@ static inline void sched_clear_itmt_support(void) > } > #endif /* CONFIG_SCHED_MC_PRIO */ > > +#ifdef CONFIG_SMP > +#include <asm/cpufeature.h> > + > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) > + > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static inline long arch_scale_freq_capacity(int cpu) > +{ > + if (arch_scale_freq_invariant()) > + return per_cpu(arch_cpu_freq, cpu); > + > + return 1024 /* SCHED_CAPACITY_SCALE */; > +} > +#define arch_scale_freq_capacity arch_scale_freq_capacity > + > +extern void arch_scale_freq_tick(void); > +#define arch_scale_freq_tick arch_scale_freq_tick > + > +extern void x86_arch_scale_freq_tick_enable(void); > +extern void x86_arch_scale_freq_tick_disable(void); > +#else > +static inline void x86_arch_scale_freq_tick_enable(void) > +{ > +} > + > +static inline void x86_arch_scale_freq_tick_disable(void) > +{ > +} > +#endif > + > #endif /* _ASM_X86_TOPOLOGY_H */ > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 69881b2d446c..7aa72a51e1e9 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) > *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; > } > > +static void set_cpu_max_freq(void); > + > /* > * Report back to the Boot Processor during boot time or to the caller processor > * during CPU online. > @@ -183,6 +185,8 @@ static void smp_callin(void) > */ > set_cpu_sibling_map(raw_smp_processor_id()); > > + set_cpu_max_freq(); > + > /* > * Get our bogomips. > * Update loops_per_jiffy in cpu_data. Previous call to > @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) > set_sched_topology(x86_topology); > > set_cpu_sibling_map(0); > - > + set_cpu_max_freq(); > smp_sanity_check(); > > switch (apic_intr_mode) { > @@ -1764,3 +1768,192 @@ void native_play_dead(void) > } > > #endif > + > +/* > + * APERF/MPERF frequency ratio computation. > + * > + * The scheduler wants to do frequency invariant accounting and needs a <1 > + * ratio to account for the 'current' frequency, corresponding to > + * freq_curr / freq_max. > + * > + * Since the frequency freq_curr on x86 is controlled by micro-controller and > + * our P-state setting is little more than a request/hint, we need to observe > + * the effective frequency 'BusyMHz', i.e. the average frequency over a time > + * interval after discarding idle time. This is given by: > + * > + * BusyMHz = delta_APERF / delta_MPERF * freq_base > + * > + * where freq_base is the max non-turbo P-state. > + * > + * The freq_max term has to be set to a somewhat arbitrary value, because we > + * can't know which turbo states will be available at a given point in time: > + * it all depends on the thermal headroom of the entire package. We set it to > + * the turbo level with 4 cores active. > + * > + * Benchmarks show that's a good compromise between the 1C turbo ratio > + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, > + * which would ignore the entire turbo range (a conspicuous part, making > + * freq_curr/freq_max always maxed out). > + * > + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio > + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. > + */ > + > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +static DEFINE_PER_CPU(u64, arch_prev_aperf); > +static DEFINE_PER_CPU(u64, arch_prev_mperf); > +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; > + > +static bool turbo_disabled(void) > +{ > + u64 misc_en; > + int err; > + > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > + if (err) > + return false; > + > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > +} This setting may be updated by the platform firmware (BIOS) in some cases (see kernel.org BZ 200759, for example), so in general checking it once at the init time is not enough. > + > +#include <asm/cpu_device_id.h> > +#include <asm/intel-family.h> > + > +#define ICPU(model) \ > + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} > + > +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { > + ICPU(INTEL_FAM6_XEON_PHI_KNL), > + ICPU(INTEL_FAM6_XEON_PHI_KNM), > + {} > +}; > + > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > + ICPU(INTEL_FAM6_SKYLAKE_X), > + {} > +}; > + > +static void core_set_cpu_max_freq(void) > +{ > + u64 ratio, turbo_ratio; > + int err; > + > + if (smp_processor_id() != 0) > + return; > + > + if (turbo_disabled() || > + x86_match_cpu(has_knl_turbo_ratio_limits) || > + x86_match_cpu(has_turbo_ratio_group_limits)) > + return; > + I would move the checks above directly to intel_set_cpu_max_freq(). > + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); > + if (err) > + return; > + > + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); > + if (err) > + return; > + > + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ > + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ > + > + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); > + > + static_branch_enable(&arch_scale_freq_key); > +} > + > +static void intel_set_cpu_max_freq(void) > +{ > + /* > + * TODO: add support for: > + * > + * - Xeon Phi (KNM, KNL) > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > + * - Atom Silvermont > + * > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > + */ > + core_set_cpu_max_freq(); > +} > + > +static void init_scale_freq(void *arg) > +{ > + u64 aperf, mperf; > + > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > +} > + > +static void set_cpu_max_freq(void) > +{ > + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > + return; > + > + switch (boot_cpu_data.x86_vendor) { > + case X86_VENDOR_INTEL: > + intel_set_cpu_max_freq(); > + break; > + default: > + break; > + } Why is the switch () needed? It seems that if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) intel_set_cpu_max_freq(); would do the trick. > + > + init_scale_freq(NULL); > +} > + > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static bool tick_disable; > + > +void arch_scale_freq_tick(void) > +{ > + u64 freq; > + u64 aperf, mperf; > + u64 acnt, mcnt; > + > + if (!arch_scale_freq_invariant() || tick_disable) > + return; > + This may be a silly question, but can using tick_disable be avoided? I guess it is there, because disabling the static branch from x86_arch_scale_freq_tick_disable() would be unsafe, but I'm not sure why that would be the case? > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + acnt = aperf - this_cpu_read(arch_prev_aperf); > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > + if (!mcnt) > + return; > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > + > + acnt <<= 2*SCHED_CAPACITY_SHIFT; > + mcnt *= arch_max_freq; > + > + freq = div64_u64(acnt, mcnt); > + > + if (freq > SCHED_CAPACITY_SCALE) > + freq = SCHED_CAPACITY_SCALE; > + > + this_cpu_write(arch_cpu_freq, freq); > +} > + > +void x86_arch_scale_freq_tick_enable(void) > +{ > + tick_disable = false; > +} > + > +static void reset_scale_freq(void *arg) > +{ > + this_cpu_write(arch_cpu_freq, SCHED_CAPACITY_SCALE); > +} > + > +void x86_arch_scale_freq_tick_disable(void) > +{ > + on_each_cpu(reset_scale_freq, NULL, 1); > + tick_disable = true; > +} > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7880f4f64d0e..2bdce4a140ae 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3593,6 +3593,7 @@ void scheduler_tick(void) > struct task_struct *curr = rq->curr; > struct rq_flags rf; > > + arch_scale_freq_tick(); > sched_clock_tick(); > > rq_lock(rq, &rf); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 0db2c1b3361e..0fe4f2dcd175 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1953,6 +1953,13 @@ static inline int hrtick_enabled(struct rq *rq) > > #endif /* CONFIG_SCHED_HRTICK */ > > +#ifndef arch_scale_freq_tick > +static __always_inline > +void arch_scale_freq_tick(void) > +{ > +} > +#endif > + > #ifndef arch_scale_freq_capacity > static __always_inline > unsigned long arch_scale_freq_capacity(int cpu) > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-03 10:27 ` Rafael J. Wysocki @ 2019-10-03 12:15 ` Peter Zijlstra 2019-10-03 17:36 ` Srinivas Pandruvada 2019-10-03 17:53 ` Rafael J. Wysocki 0 siblings, 2 replies; 25+ messages in thread From: Peter Zijlstra @ 2019-10-03 12:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich wrote: > > +static bool turbo_disabled(void) > > +{ > > + u64 misc_en; > > + int err; > > + > > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > > + if (err) > > + return false; > > + > > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > > +} > > This setting may be updated by the platform firmware (BIOS) in some cases > (see kernel.org BZ 200759, for example), so in general checking it once > at the init time is not enough. Is there anything sane we can do if the BIOS frobs stuff like that under our feet? Other than yell bloody murder, that is? > > + > > +#include <asm/cpu_device_id.h> > > +#include <asm/intel-family.h> > > + > > +#define ICPU(model) \ > > + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} > > + > > +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { > > + ICPU(INTEL_FAM6_XEON_PHI_KNL), > > + ICPU(INTEL_FAM6_XEON_PHI_KNM), > > + {} > > +}; > > + > > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > > + ICPU(INTEL_FAM6_SKYLAKE_X), > > + {} > > +}; > > + > > +static void core_set_cpu_max_freq(void) > > +{ > > + u64 ratio, turbo_ratio; > > + int err; > > + > > + if (smp_processor_id() != 0) > > + return; > > + > > + if (turbo_disabled() || > > + x86_match_cpu(has_knl_turbo_ratio_limits) || > > + x86_match_cpu(has_turbo_ratio_group_limits)) > > + return; > > + > > I would move the checks above directly to intel_set_cpu_max_freq(). The reason it is here, is that.. > > + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); > > + if (err) > > + return; > > + > > + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); > > + if (err) > > + return; > > + > > + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ > > + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ > > + > > + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); > > + > > + static_branch_enable(&arch_scale_freq_key); > > +} > > + > > +static void intel_set_cpu_max_freq(void) > > +{ > > + /* > > + * TODO: add support for: > > + * > > + * - Xeon Phi (KNM, KNL) > > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > > + * - Atom Silvermont > > + * > > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > > + */ > > + core_set_cpu_max_freq(); This used to read something like: if (core_set_cpu_max_freq()) return; if (atom_set_cpu_max_freq()) return; ... and then those checks make sense, because we're failing the 'core' way, but another way might work. But in this version the atom version has gone missing -- I've suggested it be put back as an additional patch. Also, the SKX way still needs to be written.. > > +} > > + > > +static void init_scale_freq(void *arg) > > +{ > > + u64 aperf, mperf; > > + > > + rdmsrl(MSR_IA32_APERF, aperf); > > + rdmsrl(MSR_IA32_MPERF, mperf); > > + > > + this_cpu_write(arch_prev_aperf, aperf); > > + this_cpu_write(arch_prev_mperf, mperf); > > +} > > + > > +static void set_cpu_max_freq(void) > > +{ > > + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > > + return; > > + > > + switch (boot_cpu_data.x86_vendor) { > > + case X86_VENDOR_INTEL: > > + intel_set_cpu_max_freq(); > > + break; > > + default: > > + break; > > + } > > Why is the switch () needed? > > It seems that > > if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) > intel_set_cpu_max_freq(); > > would do the trick. I was hoping to grow X86_VENDOR_AMD bits.. > > + > > + init_scale_freq(NULL); > > +} > > + > > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > > + > > +static bool tick_disable; > > + > > +void arch_scale_freq_tick(void) > > +{ > > + u64 freq; > > + u64 aperf, mperf; > > + u64 acnt, mcnt; > > + > > + if (!arch_scale_freq_invariant() || tick_disable) > > + return; > > + > > This may be a silly question, but can using tick_disable be avoided? > > I guess it is there, because disabling the static branch from > x86_arch_scale_freq_tick_disable() would be unsafe, but I'm not > sure why that would be the case? There's not enough state -- we can of course fix that. That is, if you disable it, we don't know if we should enable it again later or if it was disabled because we failed to initialize it earlier. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-03 12:15 ` Peter Zijlstra @ 2019-10-03 17:36 ` Srinivas Pandruvada 2019-10-03 17:53 ` Rafael J. Wysocki 1 sibling, 0 replies; 25+ messages in thread From: Srinivas Pandruvada @ 2019-10-03 17:36 UTC (permalink / raw) To: Peter Zijlstra, Rafael J. Wysocki Cc: Giovanni Gherdovich, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, 2019-10-03 at 14:15 +0200, Peter Zijlstra wrote: > On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > > On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich > > wrote: > > > +static bool turbo_disabled(void) > > > +{ > > > + u64 misc_en; > > > + int err; > > > + > > > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > > > + if (err) > > > + return false; > > > + > > > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > > > +} > > > > This setting may be updated by the platform firmware (BIOS) in some > > cases > > (see kernel.org BZ 200759, for example), so in general checking it > > once > > at the init time is not enough. > > Is there anything sane we can do if the BIOS frobs stuff like that > under > our feet? Other than yell bloody murder, that is? On this platform which this BZ is talking about, you will get an ACPI notification about this change. But the notification will happen for only one CPU, but it needs to be accounted for all CPUs. Thanks, Srinivas > > > > + > > > +#include <asm/cpu_device_id.h> > > > +#include <asm/intel-family.h> > > > + > > > +#define ICPU(model) \ > > > + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} > > > + > > > +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { > > > + ICPU(INTEL_FAM6_XEON_PHI_KNL), > > > + ICPU(INTEL_FAM6_XEON_PHI_KNM), > > > + {} > > > +}; > > > + > > > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = > > > { > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > > > + ICPU(INTEL_FAM6_SKYLAKE_X), > > > + {} > > > +}; > > > + > > > +static void core_set_cpu_max_freq(void) > > > +{ > > > + u64 ratio, turbo_ratio; > > > + int err; > > > + > > > + if (smp_processor_id() != 0) > > > + return; > > > + > > > + if (turbo_disabled() || > > > + x86_match_cpu(has_knl_turbo_ratio_limits) || > > > + x86_match_cpu(has_turbo_ratio_group_limits)) > > > + return; > > > + > > > > I would move the checks above directly to intel_set_cpu_max_freq(). > > The reason it is here, is that.. > > > > + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); > > > + if (err) > > > + return; > > > + > > > + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); > > > + if (err) > > > + return; > > > + > > > + ratio = (ratio >> 8) & 0xFF; /* max P state > > > ratio */ > > > + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio > > > */ > > > + > > > + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, > > > ratio); > > > + > > > + static_branch_enable(&arch_scale_freq_key); > > > +} > > > + > > > +static void intel_set_cpu_max_freq(void) > > > +{ > > > + /* > > > + * TODO: add support for: > > > + * > > > + * - Xeon Phi (KNM, KNL) > > > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > > > + * - Atom Silvermont > > > + * > > > + * which all now get by default arch_max_freq = > > > SCHED_CAPACITY_SCALE > > > + */ > > > + core_set_cpu_max_freq(); > > This used to read something like: > > if (core_set_cpu_max_freq()) > return; > > if (atom_set_cpu_max_freq()) > return; > > ... > > and then those checks make sense, because we're failing the 'core' > way, > but another way might work. > > But in this version the atom version has gone missing -- I've > suggested > it be put back as an additional patch. > > Also, the SKX way still needs to be written.. > > > > +} > > > + > > > +static void init_scale_freq(void *arg) > > > +{ > > > + u64 aperf, mperf; > > > + > > > + rdmsrl(MSR_IA32_APERF, aperf); > > > + rdmsrl(MSR_IA32_MPERF, mperf); > > > + > > > + this_cpu_write(arch_prev_aperf, aperf); > > > + this_cpu_write(arch_prev_mperf, mperf); > > > +} > > > + > > > +static void set_cpu_max_freq(void) > > > +{ > > > + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > > > + return; > > > + > > > + switch (boot_cpu_data.x86_vendor) { > > > + case X86_VENDOR_INTEL: > > > + intel_set_cpu_max_freq(); > > > + break; > > > + default: > > > + break; > > > + } > > > > Why is the switch () needed? > > > > It seems that > > > > if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) > > intel_set_cpu_max_freq(); > > > > would do the trick. > > I was hoping to grow X86_VENDOR_AMD bits.. > > > > + > > > + init_scale_freq(NULL); > > > +} > > > + > > > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > > > + > > > +static bool tick_disable; > > > + > > > +void arch_scale_freq_tick(void) > > > +{ > > > + u64 freq; > > > + u64 aperf, mperf; > > > + u64 acnt, mcnt; > > > + > > > + if (!arch_scale_freq_invariant() || tick_disable) > > > + return; > > > + > > > > This may be a silly question, but can using tick_disable be > > avoided? > > > > I guess it is there, because disabling the static branch from > > x86_arch_scale_freq_tick_disable() would be unsafe, but I'm not > > sure why that would be the case? > > There's not enough state -- we can of course fix that. > > That is, if you disable it, we don't know if we should enable it > again > later or if it was disabled because we failed to initialize it > earlier. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-03 12:15 ` Peter Zijlstra 2019-10-03 17:36 ` Srinivas Pandruvada @ 2019-10-03 17:53 ` Rafael J. Wysocki 2019-10-04 11:48 ` Peter Zijlstra 2019-10-08 7:48 ` Giovanni Gherdovich 1 sibling, 2 replies; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-03 17:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thursday, October 3, 2019 2:15:37 PM CEST Peter Zijlstra wrote: > On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > > On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich wrote: > > > +static bool turbo_disabled(void) > > > +{ > > > + u64 misc_en; > > > + int err; > > > + > > > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > > > + if (err) > > > + return false; > > > + > > > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > > > +} > > > > This setting may be updated by the platform firmware (BIOS) in some cases > > (see kernel.org BZ 200759, for example), so in general checking it once > > at the init time is not enough. > > Is there anything sane we can do if the BIOS frobs stuff like that under > our feet? Other than yell bloody murder, that is? Sane? No, I don't think so. Now, in principle *something* could be done to fix things up in the _PPC notify handler, but I guess we would just end up disabling the scale invariance code altogether in those cases. > > > + > > > +#include <asm/cpu_device_id.h> > > > +#include <asm/intel-family.h> > > > + > > > +#define ICPU(model) \ > > > + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} > > > + > > > +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { > > > + ICPU(INTEL_FAM6_XEON_PHI_KNL), > > > + ICPU(INTEL_FAM6_XEON_PHI_KNM), > > > + {} > > > +}; > > > + > > > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > > > + ICPU(INTEL_FAM6_SKYLAKE_X), > > > + {} > > > +}; > > > + > > > +static void core_set_cpu_max_freq(void) > > > +{ > > > + u64 ratio, turbo_ratio; > > > + int err; > > > + > > > + if (smp_processor_id() != 0) > > > + return; > > > + > > > + if (turbo_disabled() || > > > + x86_match_cpu(has_knl_turbo_ratio_limits) || > > > + x86_match_cpu(has_turbo_ratio_group_limits)) > > > + return; > > > + > > > > I would move the checks above directly to intel_set_cpu_max_freq(). > > The reason it is here, is that.. > > > > + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); > > > + if (err) > > > + return; > > > + > > > + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); > > > + if (err) > > > + return; > > > + > > > + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ > > > + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ > > > + > > > + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); > > > + > > > + static_branch_enable(&arch_scale_freq_key); > > > +} > > > + > > > +static void intel_set_cpu_max_freq(void) > > > +{ > > > + /* > > > + * TODO: add support for: > > > + * > > > + * - Xeon Phi (KNM, KNL) > > > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > > > + * - Atom Silvermont > > > + * > > > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > > > + */ > > > + core_set_cpu_max_freq(); > > This used to read something like: > > if (core_set_cpu_max_freq()) > return; > > if (atom_set_cpu_max_freq()) > return; > > ... > > and then those checks make sense, because we're failing the 'core' way, > but another way might work. > > But in this version the atom version has gone missing -- I've suggested > it be put back as an additional patch. > > Also, the SKX way still needs to be written.. Well, but the smp_processor_id() check has nothing to do with whether or not this is "core" or "atom" or something else, for example. And to me if (this is not core) return; do_core_stuff(); is slightly more straightforward than putting the "this is not core" check into do_core_stuff() as the former avoids a redundant call (at least in principle). > > > +} > > > + > > > +static void init_scale_freq(void *arg) > > > +{ > > > + u64 aperf, mperf; > > > + > > > + rdmsrl(MSR_IA32_APERF, aperf); > > > + rdmsrl(MSR_IA32_MPERF, mperf); > > > + > > > + this_cpu_write(arch_prev_aperf, aperf); > > > + this_cpu_write(arch_prev_mperf, mperf); > > > +} > > > + > > > +static void set_cpu_max_freq(void) > > > +{ > > > + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > > > + return; > > > + > > > + switch (boot_cpu_data.x86_vendor) { > > > + case X86_VENDOR_INTEL: > > > + intel_set_cpu_max_freq(); > > > + break; > > > + default: > > > + break; > > > + } > > > > Why is the switch () needed? > > > > It seems that > > > > if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) > > intel_set_cpu_max_freq(); > > > > would do the trick. > > I was hoping to grow X86_VENDOR_AMD bits.. Well, the if () can be changed into switch () easily enough while adding them. > > > + > > > + init_scale_freq(NULL); > > > +} > > > + > > > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > > > + > > > +static bool tick_disable; > > > + > > > +void arch_scale_freq_tick(void) > > > +{ > > > + u64 freq; > > > + u64 aperf, mperf; > > > + u64 acnt, mcnt; > > > + > > > + if (!arch_scale_freq_invariant() || tick_disable) > > > + return; > > > + > > > > This may be a silly question, but can using tick_disable be avoided? > > > > I guess it is there, because disabling the static branch from > > x86_arch_scale_freq_tick_disable() would be unsafe, but I'm not > > sure why that would be the case? > > There's not enough state -- we can of course fix that. > > That is, if you disable it, we don't know if we should enable it again > later or if it was disabled because we failed to initialize it earlier. Two bits can be used for storing that info, so BIT(1) is set when we fail to initialize it and BIT(0) is set when it is disabled later. Or similar. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-03 17:53 ` Rafael J. Wysocki @ 2019-10-04 11:48 ` Peter Zijlstra 2019-10-08 7:48 ` Giovanni Gherdovich 1 sibling, 0 replies; 25+ messages in thread From: Peter Zijlstra @ 2019-10-04 11:48 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Giovanni Gherdovich, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, Oct 03, 2019 at 07:53:49PM +0200, Rafael J. Wysocki wrote: > On Thursday, October 3, 2019 2:15:37 PM CEST Peter Zijlstra wrote: > > On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > > > > + if (smp_processor_id() != 0) > > > > + return; > Well, but the smp_processor_id() check has nothing to do with whether or not > this is "core" or "atom" or something else, for example. It is dodgy to begin with, it hard assumes we boot on cpu-0. A 'initialized' state would probably be better. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-03 17:53 ` Rafael J. Wysocki 2019-10-04 11:48 ` Peter Zijlstra @ 2019-10-08 7:48 ` Giovanni Gherdovich 2019-10-08 9:32 ` Rafael J. Wysocki 1 sibling, 1 reply; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-08 7:48 UTC (permalink / raw) To: Rafael J. Wysocki, Peter Zijlstra Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, 2019-10-03 at 19:53 +0200, Rafael J. Wysocki wrote: > On Thursday, October 3, 2019 2:15:37 PM CEST Peter Zijlstra wrote: > > On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > > > On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich wrote: > > > > +static bool turbo_disabled(void) > > > > +{ > > > > + u64 misc_en; > > > > + int err; > > > > + > > > > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > > > > + if (err) > > > > + return false; > > > > + > > > > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > > > > +} > > > > > > This setting may be updated by the platform firmware (BIOS) in some cases > > > (see kernel.org BZ 200759, for example), so in general checking it once > > > at the init time is not enough. > > > > Is there anything sane we can do if the BIOS frobs stuff like that under > > our feet? Other than yell bloody murder, that is? > > Sane? No, I don't think so. > > Now, in principle *something* could be done to fix things up in the _PPC > notify handler, but I guess we would just end up disabling the scale > invariance code altogether in those cases. I'm looking at how to react to turbo being disabled at run time, assuming a _PPC notification is triggered in that case. I don't think the correct action would be to disable scale invariance: if the turbo range is not available, then max frequency is max_P, and scale invariance can go on using that. The case max_freq=max_P is represented by arch_max_freq=1024 in this patch (because arch_max_freq=max_freq*1024/max_P). Since the variable arch_max_freq is global to all CPUs, the fact that the _PPC notification is sent to just one CPU is not a concern: the CPU receiving the notif will set arch_max_freq=1024 (Srinivas was worried about this in another message). This looks like a job for the ->update_limits callback you added to "struct cpufreq_driver" in response to the mentioned kernel.org BZ 200759. I see that only intel_pstate implements it, it's not clear to me yet if I'll have to give an ->update_limits to acpi_cpufreq as well to treat this case. Giovanni ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 1/2] x86,sched: Add support for frequency invariance 2019-10-08 7:48 ` Giovanni Gherdovich @ 2019-10-08 9:32 ` Rafael J. Wysocki 0 siblings, 0 replies; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-08 9:32 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Rafael J. Wysocki, Peter Zijlstra, Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Tue, Oct 8, 2019 at 9:43 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > On Thu, 2019-10-03 at 19:53 +0200, Rafael J. Wysocki wrote: > > On Thursday, October 3, 2019 2:15:37 PM CEST Peter Zijlstra wrote: > > > On Thu, Oct 03, 2019 at 12:27:52PM +0200, Rafael J. Wysocki wrote: > > > > On Wednesday, October 2, 2019 2:29:25 PM CEST Giovanni Gherdovich wrote: > > > > > +static bool turbo_disabled(void) > > > > > +{ > > > > > + u64 misc_en; > > > > > + int err; > > > > > + > > > > > + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); > > > > > + if (err) > > > > > + return false; > > > > > + > > > > > + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); > > > > > +} > > > > > > > > This setting may be updated by the platform firmware (BIOS) in some cases > > > > (see kernel.org BZ 200759, for example), so in general checking it once > > > > at the init time is not enough. > > > > > > Is there anything sane we can do if the BIOS frobs stuff like that under > > > our feet? Other than yell bloody murder, that is? > > > > Sane? No, I don't think so. > > > > Now, in principle *something* could be done to fix things up in the _PPC > > notify handler, but I guess we would just end up disabling the scale > > invariance code altogether in those cases. > > I'm looking at how to react to turbo being disabled at run time, assuming a > _PPC notification is triggered in that case. > > I don't think the correct action would be to disable scale invariance: if the > turbo range is not available, then max frequency is max_P, and scale > invariance can go on using that. The case max_freq=max_P is represented by > arch_max_freq=1024 in this patch (because arch_max_freq=max_freq*1024/max_P). OK, so now you have the case when the BIOS enables turbo on the fly, but then the scale invariance is not going to be enabled AFAICS. > Since the variable arch_max_freq is global to all CPUs, the fact that the _PPC > notification is sent to just one CPU is not a concern: the CPU receiving the > notif will set arch_max_freq=1024 (Srinivas was worried about this in another > message). > > This looks like a job for the ->update_limits callback you added to "struct > cpufreq_driver" in response to the mentioned kernel.org BZ 200759. > I see that only intel_pstate implements it, it's not clear to me yet if I'll > have to give an ->update_limits to acpi_cpufreq as well to treat this case. If you want acpi_cpufreq and intel_pstate to be consistent with each other in that repsect, then yes. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-02 12:29 [PATCH v2 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich @ 2019-10-02 12:29 ` Giovanni Gherdovich 2019-10-03 18:05 ` Rafael J. Wysocki 1 sibling, 1 reply; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:29 UTC (permalink / raw) To: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, Rafael J . Wysocki Cc: x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies, Giovanni Gherdovich From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> intel_pstate has two operating modes: active and passive. In "active" mode, the in-built scaling governor is used and in "passive" mode, the driver can be used with any governor like "schedutil". In "active" mode the utilization values from schedutil is not used and there is a requirement from high performance computing use cases, not to read any APERF/MPERF MSRs. In this case no need to use CPU cycles for frequency invariant accounting by reading APERF/MPERF MSRs. With this change frequency invariant account is only enabled in "passive" mode. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> --- drivers/cpufreq/intel_pstate.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 9f02de9a1b47..c7d9149e99ee 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -2493,6 +2493,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) { int ret; + x86_arch_scale_freq_tick_disable(); + memset(&global, 0, sizeof(global)); global.max_perf_pct = 100; @@ -2505,6 +2507,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) global.min_perf_pct = min_perf_pct_min(); + if (driver == &intel_cpufreq) + x86_arch_scale_freq_tick_enable(); + return 0; } -- 2.16.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-02 12:29 ` [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich @ 2019-10-03 18:05 ` Rafael J. Wysocki 2019-10-04 3:31 ` Srinivas Pandruvada 0 siblings, 1 reply; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-03 18:05 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich wrote: > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > intel_pstate has two operating modes: active and passive. In "active" > mode, the in-built scaling governor is used and in "passive" mode, > the driver can be used with any governor like "schedutil". In "active" > mode the utilization values from schedutil is not used and there is > a requirement from high performance computing use cases, not to read > any APERF/MPERF MSRs. Well, this isn't quite convincing. In particular, I don't see why the "don't read APERF/MPERF MSRs" argument applies *only* to intel_pstate in the "active" mode. What about intel_pstate in the "passive" mode combined with the "performance" governor? Or any other governor different from "schedutil" for that matter? And what about acpi_cpufreq combined with any governor different from "schedutil"? Scale invariance is not really needed in all of those cases right now AFAICS, or is it? So is the real concern that intel_pstate in the "active" mode reads the MPERF and APERF MSRs by itself and that kind of duplicates what the scale invariance code does and is redundant etc? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-03 18:05 ` Rafael J. Wysocki @ 2019-10-04 3:31 ` Srinivas Pandruvada 2019-10-04 8:08 ` Rafael J. Wysocki 2019-10-04 8:29 ` Giovanni Gherdovich 0 siblings, 2 replies; 25+ messages in thread From: Srinivas Pandruvada @ 2019-10-04 3:31 UTC (permalink / raw) To: Rafael J. Wysocki, Giovanni Gherdovich Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > wrote: > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > intel_pstate has two operating modes: active and passive. In > > "active" > > mode, the in-built scaling governor is used and in "passive" mode, > > the driver can be used with any governor like "schedutil". In > > "active" > > mode the utilization values from schedutil is not used and there is > > a requirement from high performance computing use cases, not to > > readas well > > any APERF/MPERF MSRs. > > Well, this isn't quite convincing. > > In particular, I don't see why the "don't read APERF/MPERF MSRs" > argument > applies *only* to intel_pstate in the "active" mode. What about > intel_pstate > in the "passive" mode combined with the "performance" governor? Or > any other > governor different from "schedutil" for that matter? > > And what about acpi_cpufreq combined with any governor different from > "schedutil"? > > Scale invariance is not really needed in all of those cases right now > AFAICS, > or is it? Correct. This is just part of the patch to disable in active mode (particularly in HWP and performance mode). But this patch is 2 years old. The folks who wanted this, disable intel-pstate and use userspace governor with acpi-cpufreq. So may be better to address those cases too. > > So is the real concern that intel_pstate in the "active" mode reads > the MPERF > and APERF MSRs by itself and that kind of duplicates what the scale > invariance > code does and is redundant etc? It is redundant in non-HWP mode. In HWP and performance (active mode) we don't use atleast at this time. Thanks Srinivas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 3:31 ` Srinivas Pandruvada @ 2019-10-04 8:08 ` Rafael J. Wysocki 2019-10-04 8:29 ` Giovanni Gherdovich 1 sibling, 0 replies; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-04 8:08 UTC (permalink / raw) To: Srinivas Pandruvada Cc: Rafael J. Wysocki, Giovanni Gherdovich, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, Oct 4, 2019 at 5:31 AM Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> wrote: > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > wrote: > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > intel_pstate has two operating modes: active and passive. In > > > "active" > > > mode, the in-built scaling governor is used and in "passive" mode, > > > the driver can be used with any governor like "schedutil". In > > > "active" > > > mode the utilization values from schedutil is not used and there is > > > a requirement from high performance computing use cases, not to > > > readas well > > > any APERF/MPERF MSRs. > > > > Well, this isn't quite convincing. > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" > > argument > > applies *only* to intel_pstate in the "active" mode. What about > > intel_pstate > > in the "passive" mode combined with the "performance" governor? Or > > any other > > governor different from "schedutil" for that matter? > > > > And what about acpi_cpufreq combined with any governor different from > > "schedutil"? > > > > Scale invariance is not really needed in all of those cases right now > > AFAICS, > > or is it? > > Correct. This is just part of the patch to disable in active mode > (particularly in HWP and performance mode). > > But this patch is 2 years old. The folks who wanted this, disable > intel-pstate and use userspace governor with acpi-cpufreq. So may be > better to address those cases too. Well, that's my point. :-) It looks like the scale invariance is only needed when the schedutil governor is used, regardless of the driver, and it may lead to performance degradation in the other cases, at least in principle (I wonder, though, if any hard data supporting that claim are available). That can be addressed in two ways IMO, either by reducing the possible negative impact of the scale invariance code (eg. by running it less frequently), so that it can be always enabled (as long as it is supported by the processor), or by avoiding to run it in all cases when it is not needed (but that basically would require the governor ->init and ->exit to enable and disable the scale invariance, respectively). > > > > So is the real concern that intel_pstate in the "active" mode reads > > the MPERF > > and APERF MSRs by itself and that kind of duplicates what the scale > > invariance > > code does and is redundant etc? > It is redundant in non-HWP mode. In HWP and performance (active mode) > we don't use atleast at this time. Right. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 3:31 ` Srinivas Pandruvada 2019-10-04 8:08 ` Rafael J. Wysocki @ 2019-10-04 8:29 ` Giovanni Gherdovich 2019-10-04 8:28 ` Vincent Guittot 2019-10-04 8:29 ` Rafael J. Wysocki 1 sibling, 2 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-04 8:29 UTC (permalink / raw) To: Srinivas Pandruvada, Rafael J. Wysocki Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, x86, linux-pm, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > wrote: > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > driver can be used with any governor like "schedutil". In "active" mode > > > the utilization values from schedutil is not used and there is a > > > requirement from high performance computing use cases, not to readas > > > well any APERF/MPERF MSRs. > > > > Well, this isn't quite convincing. > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > applies *only* to intel_pstate in the "active" mode. What about > > intel_pstate in the "passive" mode combined with the "performance" > > governor? Or any other governor different from "schedutil" for that > > matter? > > > > And what about acpi_cpufreq combined with any governor different from > > "schedutil"? > > > > Scale invariance is not really needed in all of those cases right now > > AFAICS, or is it? > > Correct. This is just part of the patch to disable in active mode > (particularly in HWP and performance mode). > > But this patch is 2 years old. The folks who wanted this, disable > intel-pstate and use userspace governor with acpi-cpufreq. So may be > better to address those cases too. I disagree with "scale invariance is needed only by the schedutil governor"; the two other users are the CPU's estimated utilization in the wakeup path, via cpu_util_without(), as well as the load-balance path, via cpu_util() which is used by update_sg_lb_stats(). Also remember that scale invariance is applied to both PELT signals util_avg and load_avg; schedutil uses the former but not the latter. I understand Srinivas patch to disable MSR accesses during the tick as a band-aid solution to address a specific use case he cares about, but I don't think that extending this approach to any non-schedutil governor is a good idea -- you'd be killing load balancing in the process. Giovanni ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:29 ` Giovanni Gherdovich @ 2019-10-04 8:28 ` Vincent Guittot 2019-10-04 8:33 ` Rafael J. Wysocki 2019-10-04 8:29 ` Rafael J. Wysocki 1 sibling, 1 reply; 25+ messages in thread From: Vincent Guittot @ 2019-10-04 8:28 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, x86, open list:THERMAL, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, 4 Oct 2019 at 10:24, Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > wrote: > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > > driver can be used with any governor like "schedutil". In "active" mode > > > > the utilization values from schedutil is not used and there is a > > > > requirement from high performance computing use cases, not to readas > > > > well any APERF/MPERF MSRs. > > > > > > Well, this isn't quite convincing. > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > > applies *only* to intel_pstate in the "active" mode. What about > > > intel_pstate in the "passive" mode combined with the "performance" > > > governor? Or any other governor different from "schedutil" for that > > > matter? > > > > > > And what about acpi_cpufreq combined with any governor different from > > > "schedutil"? > > > > > > Scale invariance is not really needed in all of those cases right now > > > AFAICS, or is it? > > > > Correct. This is just part of the patch to disable in active mode > > (particularly in HWP and performance mode). > > > > But this patch is 2 years old. The folks who wanted this, disable > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > better to address those cases too. > > I disagree with "scale invariance is needed only by the schedutil governor"; > the two other users are the CPU's estimated utilization in the wakeup path, > via cpu_util_without(), as well as the load-balance path, via cpu_util() which > is used by update_sg_lb_stats(). > > Also remember that scale invariance is applied to both PELT signals util_avg > and load_avg; schedutil uses the former but not the latter. You have been quicker than me to reply. I was about to say the exact same things. scale invariance also helps the scheduler in task placement by stabilizing the metrics whatever the running frequency so a task will not be seen as a big task just because of a CPU running at lower frequency > > I understand Srinivas patch to disable MSR accesses during the tick as a > band-aid solution to address a specific use case he cares about, but I don't > think that extending this approach to any non-schedutil governor is a good > idea -- you'd be killing load balancing in the process. > > > Giovanni ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:28 ` Vincent Guittot @ 2019-10-04 8:33 ` Rafael J. Wysocki 0 siblings, 0 replies; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-04 8:33 UTC (permalink / raw) To: Vincent Guittot Cc: Giovanni Gherdovich, Srinivas Pandruvada, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, open list:THERMAL, linux-kernel, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, Oct 4, 2019 at 10:28 AM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 4 Oct 2019 at 10:24, Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > > wrote: > > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > > > driver can be used with any governor like "schedutil". In "active" mode > > > > > the utilization values from schedutil is not used and there is a > > > > > requirement from high performance computing use cases, not to readas > > > > > well any APERF/MPERF MSRs. > > > > > > > > Well, this isn't quite convincing. > > > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > > > applies *only* to intel_pstate in the "active" mode. What about > > > > intel_pstate in the "passive" mode combined with the "performance" > > > > governor? Or any other governor different from "schedutil" for that > > > > matter? > > > > > > > > And what about acpi_cpufreq combined with any governor different from > > > > "schedutil"? > > > > > > > > Scale invariance is not really needed in all of those cases right now > > > > AFAICS, or is it? > > > > > > Correct. This is just part of the patch to disable in active mode > > > (particularly in HWP and performance mode). > > > > > > But this patch is 2 years old. The folks who wanted this, disable > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > > better to address those cases too. > > > > I disagree with "scale invariance is needed only by the schedutil governor"; > > the two other users are the CPU's estimated utilization in the wakeup path, > > via cpu_util_without(), as well as the load-balance path, via cpu_util() which > > is used by update_sg_lb_stats(). > > > > Also remember that scale invariance is applied to both PELT signals util_avg > > and load_avg; schedutil uses the former but not the latter. > > You have been quicker than me to reply. I was about to say the exact > same things. > scale invariance also helps the scheduler in task placement by > stabilizing the metrics whatever the running frequency so a task will > not be seen as a big task just because of a CPU running at lower > frequency So avoiding it just in one specific driver/governor configuration would be inconsistent at best. I guess that leaves us with the impact reduction option, realistically. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:29 ` Giovanni Gherdovich 2019-10-04 8:28 ` Vincent Guittot @ 2019-10-04 8:29 ` Rafael J. Wysocki 2019-10-04 8:57 ` Giovanni Gherdovich 1 sibling, 1 reply; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-04 8:29 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > wrote: > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > > driver can be used with any governor like "schedutil". In "active" mode > > > > the utilization values from schedutil is not used and there is a > > > > requirement from high performance computing use cases, not to readas > > > > well any APERF/MPERF MSRs. > > > > > > Well, this isn't quite convincing. > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > > applies *only* to intel_pstate in the "active" mode. What about > > > intel_pstate in the "passive" mode combined with the "performance" > > > governor? Or any other governor different from "schedutil" for that > > > matter? > > > > > > And what about acpi_cpufreq combined with any governor different from > > > "schedutil"? > > > > > > Scale invariance is not really needed in all of those cases right now > > > AFAICS, or is it? > > > > Correct. This is just part of the patch to disable in active mode > > (particularly in HWP and performance mode). > > > > But this patch is 2 years old. The folks who wanted this, disable > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > better to address those cases too. > > I disagree with "scale invariance is needed only by the schedutil governor"; > the two other users are the CPU's estimated utilization in the wakeup path, > via cpu_util_without(), as well as the load-balance path, via cpu_util() which > is used by update_sg_lb_stats(). OK, so there are reasons to run the scale invariance code which are not related to the cpufreq governor in use. I wonder then why those reasons are not relevant for intel_pstate in the "active" mode. > Also remember that scale invariance is applied to both PELT signals util_avg > and load_avg; schedutil uses the former but not the latter. > > I understand Srinivas patch to disable MSR accesses during the tick as a > band-aid solution to address a specific use case he cares about, but I don't > think that extending this approach to any non-schedutil governor is a good > idea -- you'd be killing load balancing in the process. But that is also the case for intel_pstate in the "active" mode, isn't it? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:29 ` Rafael J. Wysocki @ 2019-10-04 8:57 ` Giovanni Gherdovich 2019-10-04 9:17 ` Rafael J. Wysocki 2019-10-04 15:17 ` Srinivas Pandruvada 0 siblings, 2 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-04 8:57 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Srinivas Pandruvada, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote: > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > > wrote: > > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > > > driver can be used with any governor like "schedutil". In "active" mode > > > > > the utilization values from schedutil is not used and there is a > > > > > requirement from high performance computing use cases, not to readas > > > > > well any APERF/MPERF MSRs. > > > > > > > > Well, this isn't quite convincing. > > > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > > > applies *only* to intel_pstate in the "active" mode. What about > > > > intel_pstate in the "passive" mode combined with the "performance" > > > > governor? Or any other governor different from "schedutil" for that > > > > matter? > > > > > > > > And what about acpi_cpufreq combined with any governor different from > > > > "schedutil"? > > > > > > > > Scale invariance is not really needed in all of those cases right now > > > > AFAICS, or is it? > > > > > > Correct. This is just part of the patch to disable in active mode > > > (particularly in HWP and performance mode). > > > > > > But this patch is 2 years old. The folks who wanted this, disable > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > > better to address those cases too. > > > > I disagree with "scale invariance is needed only by the schedutil governor"; > > the two other users are the CPU's estimated utilization in the wakeup path, > > via cpu_util_without(), as well as the load-balance path, via cpu_util() which > > is used by update_sg_lb_stats(). > > OK, so there are reasons to run the scale invariance code which are > not related to the cpufreq governor in use. > > I wonder then why those reasons are not relevant for intel_pstate in > the "active" mode. > > > Also remember that scale invariance is applied to both PELT signals util_avg > > and load_avg; schedutil uses the former but not the latter. > > > > I understand Srinivas patch to disable MSR accesses during the tick as a > > band-aid solution to address a specific use case he cares about, but I don't > > think that extending this approach to any non-schedutil governor is a good > > idea -- you'd be killing load balancing in the process. > > But that is also the case for intel_pstate in the "active" mode, isn't it? Sure it is. Now, what's the performance impact of loosing scale-invariance in PELT signals? And what's the performance impact of accessing two MSRs at the scheduler tick on each CPU? I am sporting Srinivas' patch because he expressed the concern that the losses don't justify the gains for a specific class of users (supercomputing), although I don't fully like the idea (and arguably that should be measured). Giovanni ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:57 ` Giovanni Gherdovich @ 2019-10-04 9:17 ` Rafael J. Wysocki 2019-10-04 15:17 ` Srinivas Pandruvada 1 sibling, 0 replies; 25+ messages in thread From: Rafael J. Wysocki @ 2019-10-04 9:17 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Rafael J. Wysocki, Srinivas Pandruvada, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, Oct 4, 2019 at 10:52 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote: > > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > > > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > > > wrote: > > > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> > > > > > > > > > > > > intel_pstate has two operating modes: active and passive. In "active" > > > > > > mode, the in-built scaling governor is used and in "passive" mode, the > > > > > > driver can be used with any governor like "schedutil". In "active" mode > > > > > > the utilization values from schedutil is not used and there is a > > > > > > requirement from high performance computing use cases, not to readas > > > > > > well any APERF/MPERF MSRs. > > > > > > > > > > Well, this isn't quite convincing. > > > > > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument > > > > > applies *only* to intel_pstate in the "active" mode. What about > > > > > intel_pstate in the "passive" mode combined with the "performance" > > > > > governor? Or any other governor different from "schedutil" for that > > > > > matter? > > > > > > > > > > And what about acpi_cpufreq combined with any governor different from > > > > > "schedutil"? > > > > > > > > > > Scale invariance is not really needed in all of those cases right now > > > > > AFAICS, or is it? > > > > > > > > Correct. This is just part of the patch to disable in active mode > > > > (particularly in HWP and performance mode). > > > > > > > > But this patch is 2 years old. The folks who wanted this, disable > > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > > > better to address those cases too. > > > > > > I disagree with "scale invariance is needed only by the schedutil governor"; > > > the two other users are the CPU's estimated utilization in the wakeup path, > > > via cpu_util_without(), as well as the load-balance path, via cpu_util() which > > > is used by update_sg_lb_stats(). > > > > OK, so there are reasons to run the scale invariance code which are > > not related to the cpufreq governor in use. > > > > I wonder then why those reasons are not relevant for intel_pstate in > > the "active" mode. > > > > > Also remember that scale invariance is applied to both PELT signals util_avg > > > and load_avg; schedutil uses the former but not the latter. > > > > > > I understand Srinivas patch to disable MSR accesses during the tick as a > > > band-aid solution to address a specific use case he cares about, but I don't > > > think that extending this approach to any non-schedutil governor is a good > > > idea -- you'd be killing load balancing in the process. > > > > But that is also the case for intel_pstate in the "active" mode, isn't it? > > Sure it is. > > Now, what's the performance impact of loosing scale-invariance in PELT signals? That needs to be measured. > And what's the performance impact of accessing two MSRs at the scheduler tick > on each CPU? That would be the MSR access latency times two and I don't remember the exact numbers from the top of my head. It would also depend on how much time it takes to run the tick without those two MSR accesses, on average. The question I have, however, is whether or not it really is necessary to update arch_cpu_freq on every tick. Maybe it would be sufficient to do that every 10 ms, say (in case the tick is more frequent than that), or similar? > I am sporting Srinivas' patch because he expressed the concern that the losses > don't justify the gains for a specific class of users (supercomputing), > although I don't fully like the idea (and arguably that should be measured). My point is that this patch doesn't even cover the entire case in question, because the HPC people may very well be using cpufreq driver/governor configurations different from intel_pstate in the "active" mode. Moreover, given the lack of data, it is even hard to say what the potential impact is, if any. I guess it should be a fairly straightforward exercise to compare the results of the various benchmarks using intel_pstate in the "passive" mode and the "performance" governor with and without patch [1/2] from this series. If you see any perf regressions after applying the patch, that's what the others will probably see as well. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 8:57 ` Giovanni Gherdovich 2019-10-04 9:17 ` Rafael J. Wysocki @ 2019-10-04 15:17 ` Srinivas Pandruvada 2019-10-07 8:33 ` Giovanni Gherdovich 1 sibling, 1 reply; 25+ messages in thread From: Srinivas Pandruvada @ 2019-10-04 15:17 UTC (permalink / raw) To: Giovanni Gherdovich, Rafael J. Wysocki Cc: Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, 2019-10-04 at 10:57 +0200, Giovanni Gherdovich wrote: > On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote: > > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich < > > ggherdovich@suse.cz> wrote: > > > > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni > > > > > Gherdovich > > > > > wrote: > > > > > > From: Srinivas Pandruvada < > > > > > > srinivas.pandruvada@linux.intel.com> > > > > > > > > > > > > intel_pstate has two operating modes: active and passive. > > > > > > In "active" > > > > > > mode, the in-built scaling governor is used and in > > > > > > "passive" mode, the > > > > > > driver can be used with any governor like "schedutil". In > > > > > > "active" mode > > > > > > the utilization values from schedutil is not used and there > > > > > > is a > > > > > > requirement from high performance computing use cases, not > > > > > > to readas > > > > > > well any APERF/MPERF MSRs. > > > > > > > > > > Well, this isn't quite convincing. > > > > > > > > > > In particular, I don't see why the "don't read APERF/MPERF > > > > > MSRs" argument > > > > > applies *only* to intel_pstate in the "active" mode. What > > > > > about > > > > > intel_pstate in the "passive" mode combined with the > > > > > "performance" > > > > > governor? Or any other governor different from "schedutil" > > > > > for that > > > > > matter? > > > > > > > > > > And what about acpi_cpufreq combined with any governor > > > > > different from > > > > > "schedutil"? > > > > > > > > > > Scale invariance is not really needed in all of those cases > > > > > right now > > > > > AFAICS, or is it? > > > > > > > > Correct. This is just part of the patch to disable in active > > > > mode > > > > (particularly in HWP and performance mode). > > > > > > > > But this patch is 2 years old. The folks who wanted this, > > > > disable > > > > intel-pstate and use userspace governor with acpi-cpufreq. So > > > > may be > > > > better to address those cases too. > > > > > > I disagree with "scale invariance is needed only by the schedutil > > > governor"; > > > the two other users are the CPU's estimated utilization in the > > > wakeup path, > > > via cpu_util_without(), as well as the load-balance path, via > > > cpu_util() which > > > is used by update_sg_lb_stats(). > > > > OK, so there are reasons to run the scale invariance code which are > > not related to the cpufreq governor in use. > > > > I wonder then why those reasons are not relevant for intel_pstate > > in > > the "active" mode. > > > > > Also remember that scale invariance is applied to both PELT > > > signals util_avg > > > and load_avg; schedutil uses the former but not the latter. > > > > > > I understand Srinivas patch to disable MSR accesses during the > > > tick as a > > > band-aid solution to address a specific use case he cares about, > > > but I don't > > > think that extending this approach to any non-schedutil governor > > > is a good > > > idea -- you'd be killing load balancing in the process. > > > > But that is also the case for intel_pstate in the "active" mode, > > isn't it? > > Sure it is. > > Now, what's the performance impact of loosing scale-invariance in > PELT signals? > And what's the performance impact of accessing two MSRs at the > scheduler tick > on each CPU? > > I am sporting Srinivas' patch because he expressed the concern that > the losses > don't justify the gains for a specific class of users > (supercomputing), > although I don't fully like the idea (and arguably that should be > measured). > I understand there are other impact of the scale invariance like in deadline code, which I didn't see when I submitted this patch. You can drop this patch at this time if you like. I can poke HPC folks to test a released kernel. Thanks, Srinivas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-10-04 15:17 ` Srinivas Pandruvada @ 2019-10-07 8:33 ` Giovanni Gherdovich 0 siblings, 0 replies; 25+ messages in thread From: Giovanni Gherdovich @ 2019-10-07 8:33 UTC (permalink / raw) To: Srinivas Pandruvada, Rafael J. Wysocki Cc: Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov, Len Brown, the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List, Mel Gorman, Matt Fleming, Viresh Kumar, Juri Lelli, Paul Turner, Vincent Guittot, Quentin Perret, Dietmar Eggemann, Doug Smythies On Fri, 2019-10-04 at 08:17 -0700, Srinivas Pandruvada wrote: > On Fri, 2019-10-04 at 10:57 +0200, Giovanni Gherdovich wrote: > > On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote: > > > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich < > > > ggherdovich@suse.cz> wrote: > > > > > > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote: > > > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote: > > > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich > > > > > > wrote: > > > > > > > From: Srinivas Pandruvada < srinivas.pandruvada@linux.intel.com> > > > > > > > > > > > > > > intel_pstate has two operating modes: active and passive. In > > > > > > > "active" mode, the in-built scaling governor is used and in > > > > > > > "passive" mode, the driver can be used with any governor like > > > > > > > "schedutil". In "active" mode the utilization values from > > > > > > > schedutil is not used and there is a requirement from high > > > > > > > performance computing use cases, not to readas well any > > > > > > > APERF/MPERF MSRs. > > > > > > > > > > > > Well, this isn't quite convincing. > > > > > > > > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" > > > > > > argument applies *only* to intel_pstate in the "active" mode. > > > > > > What about intel_pstate in the "passive" mode combined with the > > > > > > "performance" governor? Or any other governor different from > > > > > > "schedutil" for that matter? > > > > > > > > > > > > And what about acpi_cpufreq combined with any governor different > > > > > > from "schedutil"? > > > > > > > > > > > > Scale invariance is not really needed in all of those cases right > > > > > > now AFAICS, or is it? > > > > > > > > > > Correct. This is just part of the patch to disable in active mode > > > > > (particularly in HWP and performance mode). > > > > > > > > > > But this patch is 2 years old. The folks who wanted this, disable > > > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be > > > > > better to address those cases too. > > > > > > > > I disagree with "scale invariance is needed only by the schedutil > > > > governor"; the two other users are the CPU's estimated utilization in > > > > the wakeup path, via cpu_util_without(), as well as the load-balance > > > > path, via cpu_util() which is used by update_sg_lb_stats(). > > > > > > OK, so there are reasons to run the scale invariance code which are > > > not related to the cpufreq governor in use. > > > > > > I wonder then why those reasons are not relevant for intel_pstate in the > > > "active" mode. > > > > > > > Also remember that scale invariance is applied to both PELT signals > > > > util_avg and load_avg; schedutil uses the former but not the latter. > > > > > > > > I understand Srinivas patch to disable MSR accesses during the tick as > > > > a band-aid solution to address a specific use case he cares about, but > > > > I don't think that extending this approach to any non-schedutil > > > > governor is a good idea -- you'd be killing load balancing in the > > > > process. > > > > > > But that is also the case for intel_pstate in the "active" mode, > > > isn't it? > > > > Sure it is. > > > > Now, what's the performance impact of loosing scale-invariance in PELT > > signals? And what's the performance impact of accessing two MSRs at the > > scheduler tick on each CPU? > > > > I am sporting Srinivas' patch because he expressed the concern that the > > losses don't justify the gains for a specific class of users > > (supercomputing), although I don't fully like the idea (and arguably that > > should be measured). > > > > I understand there are other impact of the scale invariance like in > deadline code, which I didn't see when I submitted this patch. > You can drop this patch at this time if you like. I can poke HPC folks > to test a released kernel. Thanks Srinivas, in v3 I'll drop the tick_disable mechanism for now. Giovanni ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2019-10-08 9:32 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-02 12:29 [PATCH v2 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich 2019-10-02 12:29 ` [PATCH v2 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-10-02 15:23 ` kbuild test robot 2019-10-02 15:49 ` Giovanni Gherdovich 2019-10-02 16:43 ` kbuild test robot 2019-10-02 18:27 ` Peter Zijlstra 2019-10-03 10:27 ` Rafael J. Wysocki 2019-10-03 12:15 ` Peter Zijlstra 2019-10-03 17:36 ` Srinivas Pandruvada 2019-10-03 17:53 ` Rafael J. Wysocki 2019-10-04 11:48 ` Peter Zijlstra 2019-10-08 7:48 ` Giovanni Gherdovich 2019-10-08 9:32 ` Rafael J. Wysocki 2019-10-02 12:29 ` [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich 2019-10-03 18:05 ` Rafael J. Wysocki 2019-10-04 3:31 ` Srinivas Pandruvada 2019-10-04 8:08 ` Rafael J. Wysocki 2019-10-04 8:29 ` Giovanni Gherdovich 2019-10-04 8:28 ` Vincent Guittot 2019-10-04 8:33 ` Rafael J. Wysocki 2019-10-04 8:29 ` Rafael J. Wysocki 2019-10-04 8:57 ` Giovanni Gherdovich 2019-10-04 9:17 ` Rafael J. Wysocki 2019-10-04 15:17 ` Srinivas Pandruvada 2019-10-07 8:33 ` Giovanni Gherdovich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).