* [PATCH 0/2] Add support for frequency invariance for (some) x86 @ 2019-09-09 2:42 Giovanni Gherdovich 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-09 2:42 UTC (permalink / raw) To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, Giovanni Gherdovich This is a resend with of Peter Zijlstra's patch to support frequency scale-invariance on x86 from May 2018 [see 1]. I've added some modifications and included performance test results. If Peter doesn't mind, I'm slapping my name on it :) The changes from Peter's original implementation are: 1) normalizing against the 4-cores turbo level instead or 1-core turbo 2) removing the run-time search for when the above value isn't found in the various Intel MSRs -- the base frequency value is taken in that case. The section "4. KNOWN LIMITATIONS" in the first patch commit message addresses the reason why this approach was dropped back in 2018, and explains that the performance gains outweight that issue. The second patch from Srinivas is taken verbatim from the May 2018 submission as it still applies. I apologies for the length of patch #1 commit message; I've made a table of contents with summaries of each section that should make easier to skim through the content. This submission incorporates the feedback and requests for additional tests received during the presentation made at OSPM 2019 in Pisa three months ago. [1] https://lore.kernel.org/lkml/20180516044911.28797-2-srinivas.pandruvada@linux.intel.com/ Giovanni Gherdovich (1): x86,sched: Add support for frequency invariance Srinivas Pandruvada (1): cpufreq: intel_pstate: Conditional frequency invariant accounting arch/x86/include/asm/topology.h | 29 +++++++ arch/x86/kernel/smpboot.c | 180 +++++++++++++++++++++++++++++++++++++++- drivers/cpufreq/intel_pstate.c | 5 ++ kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 ++ 5 files changed, 221 insertions(+), 1 deletion(-) -- 2.16.4 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 [PATCH 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich @ 2019-09-09 2:42 ` Giovanni Gherdovich 2019-09-11 15:28 ` Doug Smythies ` (4 more replies) 2019-09-09 2:42 ` [PATCH 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich 2019-09-24 16:01 ` [PATCH 0/2] Add support for frequency invariance for (some) x86 Peter Zijlstra 2 siblings, 5 replies; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-09 2:42 UTC (permalink / raw) To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, Giovanni Gherdovich Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency (or base frequency if nothing else is reported by MSRs) 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). Note that when the function intel_set_cpu_max_freq() fails to query the various MSRs for the 4C turbo value, the variable arch_max_freq retains its default value of SCHED_CAPACITY_SCALE (1024) that corresponds to setting freq_max to base frequency wrt the table above. +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo * (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> --- arch/x86/include/asm/topology.h | 29 +++++++ arch/x86/kernel/smpboot.c | 180 +++++++++++++++++++++++++++++++++++++++- kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 ++ 4 files changed, 216 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..462edd6aefd5 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,33 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include <asm/cpufeature.h> + +#define arch_scale_freq_tick arch_scale_freq_tick +#define arch_scale_freq_capacity arch_scale_freq_capacity + +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); + +static inline long arch_scale_freq_capacity(int cpu) +{ + if (static_cpu_has(X86_FEATURE_APERFMPERF)) + return per_cpu(arch_cpu_freq, cpu); + + return 1024 /* SCHED_CAPACITY_SCALE */; +} + +extern void arch_scale_freq_tick(void); +extern void x86_arch_scale_freq_tick_enable(void); +extern void x86_arch_scale_freq_tick_disable(void); +#else +static inline void x86_arch_scale_freq_tick_enable(void) +{ +} + +static inline void x86_arch_scale_freq_tick_disable(void) +{ +} +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index fdbd47ceb84d..dd6ae8087cc0 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void set_cpu_max_freq(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + set_cpu_max_freq(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1342,7 +1346,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + set_cpu_max_freq(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,177 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +#define ICPU(model) \ + { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_X), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static void core_set_cpu_max_freq(void) +{ + u64 ratio, turbo_ratio; + int err; + + if (smp_processor_id() != 0) + return; + + if (turbo_disabled() || + x86_match_cpu(has_knl_turbo_ratio_limits) || + x86_match_cpu(has_turbo_ratio_group_limits)) + return; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); + if (err) + return; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); + if (err) + return; + + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); +} + +static void intel_set_cpu_max_freq(void) +{ + /* + * TODO: add support for: + * + * - Xeon Phi (KNM, KNL) + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus + * - Atom Silvermont + * + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE + */ + core_set_cpu_max_freq(); +} + +static void set_cpu_max_freq(void) +{ + u64 aperf, mperf; + + if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + switch (boot_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + intel_set_cpu_max_freq(); + break; + default: + break; + } + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); + +static bool tick_disable; + +void arch_scale_freq_tick(void) +{ + u64 freq; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq; + + freq = div64_u64(acnt, mcnt); + + if (freq > SCHED_CAPACITY_SCALE) + freq = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_cpu_freq, freq); +} + +void x86_arch_scale_freq_tick_enable(void) +{ + tick_disable = false; +} + +void x86_arch_scale_freq_tick_disable(void) +{ + tick_disable = true; +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 010d578118d6..9fb4af689dfd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3463,6 +3463,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; + arch_scale_freq_tick(); sched_clock_tick(); rq_lock(rq, &rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3405f2..0b724c06b4d9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1934,6 +1934,13 @@ static inline int hrtick_enabled(struct rq *rq) #endif /* CONFIG_SCHED_HRTICK */ +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_freq_capacity static __always_inline unsigned long arch_scale_freq_capacity(int cpu) -- 2.16.4 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* RE: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich @ 2019-09-11 15:28 ` Doug Smythies 2019-09-13 20:58 ` Doug Smythies 2019-09-13 22:52 ` Srinivas Pandruvada ` (3 subsequent siblings) 4 siblings, 1 reply; 27+ messages in thread From: Doug Smythies @ 2019-09-11 15:28 UTC (permalink / raw) To: 'Giovanni Gherdovich' Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, Doug Smythies Hi Giovanni, Thank you for the great detail and test results you provided. On 2019.09.08.07:42 Giovanni Gherdovich wrote: ... [snip]... > The test we call "gitsource" (running the git unit test suite, a long-running > single-threaded shell script) appears rather spectacular in this table (gains > of 30-50% depending on the machine). It is to be noted, however, that > gitsource has no adjustable parameters (such as the number of jobs in > kernbench, which we average over in order to get a single-number summary > score) and is exactly the kind of low-parallelism workload that benefits the > most from this patch. When looking at the detailed tables of kernbench or > tbench4, at low process or client counts one can see similar numbers. I think the "gitsource" test, is the one I learned about here two years ago, [1]. It is an extremely good (best I know of) example of single threaded, high PID consumption (about 400 / second average, my computer [3]), performance issues on a multi CPU computer. I.E., this: Dountil the list of tasks is finished: Start the next task in the list of stuff to do. Wait for it to finish Enduntil The problem with the test is its run to run variability, which was from all the disk I/O, as far as I could determine. At the time, I studied this to death [2], and made a more repeatable test, without any disk I/O. While the challenges with this work flow have tended to be focused on the CPU frequency scaling driver, I have always considered the root issue here to be a scheduling issue. Excerpt from my notes [2]: > The issue is that performance is much much better if the system is > forced to use only 1 CPU rather than relying on the defaults where > the CPU scheduler decides what to do. > The scheduler seems to not realize that the current CPU has just > become free, and assigns the new task to a new CPU. Thus the load > on any one CPU is so low that it doesn't ramp up the CPU frequency. > It would be better if somehow the scheduler knew that the current > active CPU was now able to take on the new task, overall resulting > on one fully loaded CPU at the highest CPU frequency. I do not know if such is practical, and I didn't re-visit the issue. Anyway these are my results: Kernel: 5.3-rc8 and + these patches Processor: i7-2600K This is important, at least for the performance governor numbers: cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526 35 * 100.0 = 3500.0 MHz max turbo 4 active cores 36 * 100.0 = 3600.0 MHz max turbo 3 active cores 37 * 100.0 = 3700.0 MHz max turbo 2 active cores 38 * 100.0 = 3800.0 MHz max turbo 1 active cores For reference against which all other results are compared is the forced CPU affinity test run. i.e.: taskset -c 3 test_script. Mode Governor degradation Power Bzy_MHz Reference perf 1 CPU 1.00 reference 3798 - performance 1.2 6% worse 3618 passive ondemand 2.3 active powersave 2.6 passive schedutil 2.7 1600 passive schedutil-4C 1.68 2515 Where degradation ratio is the time to execute / the reference time for the same conditions. The test runs over a wide range of processes per second, and the worst ratio has been selected for the above table. I have yet to write up this experiment, but the graphs that will eventually be used are at [4] and [5] (same data presented two different ways). The energy for the performance cases is worth more detail, as it is being wasted with CPUs waking up and going to sleep, and can be observed in the IRQ column of turbostat output: $ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60 Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt 12.52 3798 81407 49 22.17 0.12 <<< Forced to CPU 3 only 12.52 3798 81139 51 22.18 0.12 12.52 3798 81036 51 22.20 0.12 11.43 3704 267644 48 21.16 0.12 <<< Change over 12.56 3618 490994 48 23.43 0.12 <<< Let the scheduler decide 12.56 3620 491336 47 23.50 0.12 12.56 3619 491607 47 23.50 0.12 12.56 3619 491512 48 23.52 0.12 12.56 3619 490806 47 23.51 0.12 12.56 3618 491356 49 23.48 0.12 12.56 3618 491035 48 23.51 0.12 12.56 3618 491121 48 23.46 0.12 Note also the busy megahertz column, where other active cores (constantly waking and sleeping as we rotate through which CPUs are used) are limiting the highest frequency. ... Doug [1] https://marc.info/?l=linux-kernel&m=149181369622980&w=2 [2] http://www.smythies.com/~doug/linux/single-threaded/index.html [3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png [4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png [5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-11 15:28 ` Doug Smythies @ 2019-09-13 20:58 ` Doug Smythies 2019-09-17 14:25 ` Giovanni Gherdovich 0 siblings, 1 reply; 27+ messages in thread From: Doug Smythies @ 2019-09-13 20:58 UTC (permalink / raw) To: 'Giovanni Gherdovich' Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw On 2019.09.11 08:28 Doug Smythies wrote: > Hi Giovanni, > > Thank you for the great detail and test results you provided. > > On 2019.09.08.07:42 Giovanni Gherdovich wrote: > > ... [snip]... > >> The test we call "gitsource" (running the git unit test suite, a long-running >> single-threaded shell script) appears rather spectacular in this table (gains >> of 30-50% depending on the machine). It is to be noted, however, that >> gitsource has no adjustable parameters (such as the number of jobs in >> kernbench, which we average over in order to get a single-number summary >> score) and is exactly the kind of low-parallelism workload that benefits the >> most from this patch. When looking at the detailed tables of kernbench or >> tbench4, at low process or client counts one can see similar numbers. > > I think the "gitsource" test, is the one I learned about here two years > ago, [1]. It is an extremely good (best I know of) example of single > threaded, high PID consumption (about 400 / second average, my computer > [3]), performance issues on a multi CPU computer. I.E., this: > > Dountil the list of tasks is finished: > Start the next task in the list of stuff to do. > Enduntil > > The problem with the test is its run to run variability, which was from > all the disk I/O, as far as I could determine. I forgot, also some memory caching. I always toss out the first test, then do it 5 more times. If I do not do much stuff with my hard disk in between tests, it is repeatable enough. I did the "make test" method and, presenting the numbers your way, got that 4C took 0.69 times as long as the unpatched schedutil. Your numbers were same or better (copied below, lower is better): 80x-BROADWELL-NUMA: 0.49 8x-SKYLAKE-UMA: 0.55 48x-HASWELL-NUMA: 0.69 > At the time, > I studied this to death [2], and made a more repeatable test, without > any disk I/O. > > While the challenges with this work flow have tended to be focused > on the CPU frequency scaling driver, I have always considered > the root issue here to be a scheduling issue. Excerpt from my notes > [2]: > >> The issue is that performance is much much better if the system is >> forced to use only 1 CPU rather than relying on the defaults where >> the CPU scheduler decides what to do. >> The scheduler seems to not realize that the current CPU has just >> become free, and assigns the new task to a new CPU. Thus the load >> on any one CPU is so low that it doesn't ramp up the CPU frequency. >> It would be better if somehow the scheduler knew that the current >> active CPU was now able to take on the new task, overall resulting >> on one fully loaded CPU at the highest CPU frequency. > > I do not know if such is practical, and I didn't re-visit the issue. > > Anyway these are my results: > > Kernel: 5.3-rc8 and + these patches > Processor: i7-2600K > > This is important, at least for the performance governor numbers: > > cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526 > 35 * 100.0 = 3500.0 MHz max turbo 4 active cores > 36 * 100.0 = 3600.0 MHz max turbo 3 active cores > 37 * 100.0 = 3700.0 MHz max turbo 2 active cores > 38 * 100.0 = 3800.0 MHz max turbo 1 active cores > > For reference against which all other results are compared > is the forced CPU affinity test run. i.e.: > > taskset -c 3 test_script. > > Mode Governor degradation Power Bzy_MHz > Reference perf 1 CPU 1.00 reference 3798 > - performance 1.2 6% worse 3618 > passive ondemand 2.3 > active powersave 2.6 > passive schedutil 2.7 1600 > passive schedutil-4C 1.68 2515 > > Where degradation ratio is the time to execute / the reference time for > the same conditions. The test runs over a wide range of processes per > second, and the worst ratio has been selected for the above table. > I have yet to write up this experiment, but the graphs that will > eventually be used are at [4] and [5] (same data presented two > different ways). The experiment write up is at [6], however I wanted more data from the lower tasks per second region, and so I re-did it, [7]. In the limit as sequential tasks per second goes to 0, the differences should diminish and I wanted to clearly observe this. Excerpt: > Conclusion: the schedutil governor improves from the worst > governor to (mostly) second only to the performance governor > for unforced CPU affinity execution. > The energy for the performance cases is worth more detail, as it > is being wasted with CPUs waking up and going to sleep, and can be > observed in the IRQ column of turbostat output: > > $ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60 > Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt > 12.52 3798 81407 49 22.17 0.12 <<< Forced to CPU 3 only > 12.52 3798 81139 51 22.18 0.12 > 12.52 3798 81036 51 22.20 0.12 > 11.43 3704 267644 48 21.16 0.12 <<< Change over > 12.56 3618 490994 48 23.43 0.12 <<< Let the scheduler decide > 12.56 3620 491336 47 23.50 0.12 > 12.56 3619 491607 47 23.50 0.12 > 12.56 3619 491512 48 23.52 0.12 > 12.56 3619 490806 47 23.51 0.12 > 12.56 3618 491356 49 23.48 0.12 > 12.56 3618 491035 48 23.51 0.12 > 12.56 3618 491121 48 23.46 0.12 > > Note also the busy megahertz column, where other active cores > (constantly waking and sleeping as we rotate through which > CPUs are used) are limiting the highest frequency. I looked at the power and idle statistics for this forced verses unforced CPU affinity scenario in more detail, [8]. (which isn't really part of this patch sets concern.) Just an additional note: >> +-------------------------------------------------------------------------+ >> | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR >> +-------------------------------------------------------------------------+ >> >> Once an architecture implements a frequency scale-invariant utilization (the >> PELT signal util_avg), schedutil switches its frequency selection formula from >> >> freq_next = 1.25 * freq_curr * util [non-invariant util signal] >> >> to >> >> freq_next = 1.25 * freq_max * util [invariant util signal] >> >> where, in the second formula, freq_max is set to the 1C turbo frequency (max >> turbo). The advantage of the second formula, whose usage we unlock with this >> patch, is that freq_next doesn't depend on the current frequency in an >> iterative fashion, but can jump to any frequency in a single update. This >> absence of feedback in the formula makes it quicker to react to utilization >> changes and more robust against pathological instabilities. >> >> Compare it to the update formula of intel_pstate/powersave: >> >> freq_next = 1.25 * freq_max * Busy% >> >> where again freq_max is 1C turbo and Busy% is the percentage of time not spent >> idling (calculated with delta_MPERF / delta_TSC); Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old method of utilization does not (at least not last time I investigated, which was awhile ago (and I can not find my notes)). ... Doug > [1] https://marc.info/?l=linux-kernel&m=149181369622980&w=2 > [2] http://www.smythies.com/~doug/linux/single-threaded/index.html > [3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png > [4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png > [5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png [6] http://www.smythies.com/~doug/linux/single-threaded/k53rc8gg.html [7] http://www.smythies.com/~doug/linux/single-threaded/k53rc8gg2.html [8] http://www.smythies.com/~doug/linux/single-threaded/idle01/index.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-13 20:58 ` Doug Smythies @ 2019-09-17 14:25 ` Giovanni Gherdovich 2019-09-19 14:42 ` Doug Smythies 0 siblings, 1 reply; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-17 14:25 UTC (permalink / raw) To: Doug Smythies Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw Hello Doug, thanks for testing as usual, having some review on the experimental results is really helpful. Sorry for the late reply as I'm traveling at the moment. You raise really good points regarding pinning workloads on cpus, my comments below. On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: > [...] > > I think the "gitsource" test, is the one I learned about here two years > ago, [1]. It is an extremely good (best I know of) example of single > threaded, high PID consumption (about 400 / second average, my computer > [3]), performance issues on a multi CPU computer. I.E., this: > > Dountil the list of tasks is finished: > Start the next task in the list of stuff to do. > Wait for it to finish > Enduntil > yes that's the one. > The problem with the test is its run to run variability, which was from > all the disk I/O, as far as I could determine. At the time, > I studied this to death [2], and made a more repeatable test, without > any disk I/O. > > While the challenges with this work flow have tended to be focused > on the CPU frequency scaling driver, I have always considered > the root issue here to be a scheduling issue. Excerpt from my notes > [2]: > > > The issue is that performance is much much better if the system is > > forced to use only 1 CPU rather than relying on the defaults where > > the CPU scheduler decides what to do. > > The scheduler seems to not realize that the current CPU has just > > become free, and assigns the new task to a new CPU. Thus the load > > on any one CPU is so low that it doesn't ramp up the CPU frequency. > > It would be better if somehow the scheduler knew that the current > > active CPU was now able to take on the new task, overall resulting > > on one fully loaded CPU at the highest CPU frequency. > > I do not know if such is practical, and I didn't re-visit the issue. > You're absolutely right, pinning a serialized, fork-intensive workload such as gitsource gives you as good of a performance as you can get, because it removes the scheduler out of the picture. So one might be tempted to flag this test as non-representative of a real-world scenario; the reasons we keep looking at it are: 1. pinning may not always practical, as you mention 2. it's an adversary, worst-case sort of test for some scheduler code paths Experience with enterprise use cases shows that pinning (as with 'taskset') is done on a case-by-case basis, requires a little more cognitive load (you have to know the workload in depth, profile it, write ad-hoc scripts to do the pinning or modify the code of your software etc). In the case of "personal computing" one hardly bothers about pinning at all. You definitely want to try those things for the software that runs the core of your business (say, sometimes the database server), but there is a ton of ancillary infrastructure out there which is implemented in shell scripts because it does the job just fine, and it doesn't harm if that goes a little faster. The unbound workload (no cpu pinning) will always perform worse than the bound scenario, simply because the scheduler can't know the future, and it's a good upper limit to keep in mind when evaluating these results. When a task is freshly forked the schedutil governor can evaluate its compute need only by the initialization value of the 'util' PELT signal (see "Per-entity load tracking" at [LWN-1]); when a task is migrated from a CPU to another its utilization score is transferred accordingly, so the accrued amount isn't lost (see again PELT and also "Toward better CPU load estimation" at [LWN-2]). These are active development areas in the scheduler, and gitsource (as well as other tests) give an idea of the progress done so far. [LWN-1] Per-entity load tracking, https://lwn.net/Articles/531853/ [LWN-2] Toward better CPU load estimation, https://lwn.net/Articles/741171/ > Anyway these are my results: > > Kernel: 5.3-rc8 and + these patches > Processor: i7-2600K > > This is important, at least for the performance governor numbers: > > cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526 > 35 * 100.0 = 3500.0 MHz max turbo 4 active cores > 36 * 100.0 = 3600.0 MHz max turbo 3 active cores > 37 * 100.0 = 3700.0 MHz max turbo 2 active cores > 38 * 100.0 = 3800.0 MHz max turbo 1 active cores > > For reference against which all other results are compared > is the forced CPU affinity test run. i.e.: > > taskset -c 3 test_script. > > Mode Governor degradation Power Bzy_MHz > Reference perf 1 CPU 1.00 reference 3798 > - performance 1.2 6% worse 3618 > passive ondemand 2.3 > active powersave 2.6 > passive schedutil 2.7 1600 > passive schedutil-4C 1.68 2515 > > Where degradation ratio is the time to execute / the reference time for > the same conditions. The test runs over a wide range of processes per > second, and the worst ratio has been selected for the above table. > I have yet to write up this experiment, but the graphs that will > eventually be used are at [4] and [5] (same data presented two > different ways). Your table is interesting; I'd say that the one to beat there (from the schedutil point of view) is intel_pstate(active)/performance. I'm slightly surprised that intel_pstate(passive)/ondemand is worse than intel_pstate(active)/powersave, I'd have guessed the other way around but it's also true that the latter lost some grip on iowait_boost in of the recent dev cycles. > > I did the "make test" method and, presenting the numbers your way, > got that 4C took 0.69 times as long as the unpatched schedutil. > Your numbers were same or better (copied below, lower is better): > 80x-BROADWELL-NUMA: 0.49 > 8x-SKYLAKE-UMA: 0.55 > 48x-HASWELL-NUMA: 0.69 > I think your 0.69 and my three values tell the same story: schedutil really needs to use the frequency invariant formula otherwise it's out of the race. Enabling scale-invariance gives multple tens of percent point in advantage. Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I can test on multiple machines and get a somehow more varied picture. Also, didn't you mention you made several runs and selected the worst one for the final score? I was less adventurous and took the average of 5 runs for my gitsource executions :) that might contribute to a slightly higher final mark. > > > > > > Compare it to the update formula of intel_pstate/powersave: > > > > > > freq_next = 1.25 * freq_max * Busy% > > > > > > where again freq_max is 1C turbo and Busy% is the percentage of time not spent > > > idling (calculated with delta_MPERF / delta_TSC); > > Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old > method of utilization does not (at least not last time I investigated, which was > awhile ago (and I can not find my notes)). I think that depends on whether or not TSC stops at idle. As understand from the Intel Software Developer manual (SDM) a TSC that stops at idle is called "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the two counters behaves exactly the same and the ratio is always 1, modulo the delays in actually reading the two values. But all I know comes from turbostat's man page and the SDM, so don't quote me on that :) Thanks, Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-17 14:25 ` Giovanni Gherdovich @ 2019-09-19 14:42 ` Doug Smythies 2019-09-24 8:06 ` Mel Gorman 0 siblings, 1 reply; 27+ messages in thread From: Doug Smythies @ 2019-09-19 14:42 UTC (permalink / raw) To: 'Giovanni Gherdovich' Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw Hi Giovanni, Thank you for your detailed reply. On 2019.09.17 07:25 Giovanni Gherdovich wrote: >On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: > [...] >> The problem with the test is its run to run variability, which was from >> all the disk I/O, as far as I could determine. At the time, >> I studied this to death [2], and made a more repeatable test, without >> any disk I/O. >> >> While the challenges with this work flow have tended to be focused >> on the CPU frequency scaling driver, I have always considered >> the root issue here to be a scheduling issue. Excerpt from my notes >> [2]: >> >>> The issue is that performance is much much better if the system is >>> forced to use only 1 CPU rather than relying on the defaults where >>> the CPU scheduler decides what to do. >>> The scheduler seems to not realize that the current CPU has just >>> become free, and assigns the new task to a new CPU. Thus the load >>> on any one CPU is so low that it doesn't ramp up the CPU frequency. >>> It would be better if somehow the scheduler knew that the current >>> active CPU was now able to take on the new task, overall resulting >>> on one fully loaded CPU at the highest CPU frequency. >> >> I do not know if such is practical, and I didn't re-visit the issue. >> > > You're absolutely right, pinning a serialized, fork-intensive workload such as > gitsource gives you as good of a performance as you can get, because it removes > the scheduler out of the picture. > > So one might be tempted to flag this test as non-representative of a > real-world scenario; Disagree. I consider this test to be very representative of real-world scenarios. However, and I do not know for certain, the relatively high average fork rate of the gitsource "make test" is less common. > the reasons we keep looking at it are: > 1. pinning may not always practical, as you mention > 2. it's an adversary, worst-case sort of test for some scheduler code paths Agree. >> For reference against which all other results are compared >> is the forced CPU affinity test run. i.e.: >> >> taskset -c 3 test_script. >> >> Mode Governor degradation Power Bzy_MHz >> Reference perf 1 CPU 1.00 reference 3798 >> - performance 1.2 6% worse 3618 >> passive ondemand 2.3 >> active powersave 2.6 >> passive schedutil 2.7 1600 >> passive schedutil-4C 1.68 2515 >> >> Where degradation ratio is the time to execute / the reference time for >> the same conditions. The test runs over a wide range of processes per >> second, and the worst ratio has been selected for the above table. >> I have yet to write up this experiment, but the graphs that will >> eventually be used are at [4] and [5] (same data presented two >> different ways). > > Your table is interesting; I'd say that the one to beat there (from the > schedutil point of view) is intel_pstate(active)/performance. I'm slightly > surprised that intel_pstate(passive)/ondemand is worse than > intel_pstate(active)/powersave, I'd have guessed the other way around but it's > also true that the latter lost some grip on iowait_boost in of the recent > dev cycles. ?? intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave, not worse, over the entire range of PIDs (forks) per second and by quite a lot. >> I did the "make test" method and, presenting the numbers your way, >> got that 4C took 0.69 times as long as the unpatched schedutil. >> Your numbers were same or better (copied below, lower is better): >> 80x-BROADWELL-NUMA: 0.49 >> 8x-SKYLAKE-UMA: 0.55 >> 48x-HASWELL-NUMA: 0.69 > I think your 0.69 and my three values tell the same story: schedutil really > needs to use the frequency invariant formula otherwise it's out of the > race. Enabling scale-invariance gives multple tens of percent point in > advantage. Agreed. This frequency invariant addition is great. However, if schedutil is "out of the race" without it, as you say, then isn't intel_pstate(passive)/ondemand out of the race also? It performs just as poorly for this test, until very low PIDs per second. > Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I > can test on multiple machines and get a somehow more varied picture. > > Also, didn't you mention you made several runs and selected the worst one for > the final score? I was less adventurous and took the average of 5 runs for my > gitsource executions :) that might contribute to a slightly higher final mark. No, I did the exact same as you for the gitsource "make test" method, except that I do 6 runs and throw out the first one and average the next 5. Yes, I said I picked the worse ratio, but that was for my version of this test, with the disk I/O and its related non-repeatability eliminated, only to provide something for readers that did not want to go to my web site to look at the related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't go to the web site. >>>> >>>> Compare it to the update formula of intel_pstate/powersave: >>> >>> freq_next = 1.25 * freq_max * Busy% >>> >>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent >>> idling (calculated with delta_MPERF / delta_TSC); >> >> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old >> method of utilization does not (at least not last time I investigated, which was >> awhile ago (and I can not find my notes)). > > I think that depends on whether or not TSC stops at idle. As understand from > the Intel Software Developer manual (SDM) a TSC that stops at idle is called > "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the > two counters behaves exactly the same and the ratio is always 1, modulo the > delays in actually reading the two values. But all I know comes from > turbostat's man page and the SDM, so don't quote me on that :) I was only talking about idle state 0 (polling), where TSC does not stop. By the way, I have now done some tests with this patch set and multi-threaded stuff. Nothing to report, it all looks great. [1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png ... Doug ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-19 14:42 ` Doug Smythies @ 2019-09-24 8:06 ` Mel Gorman 2019-09-24 17:52 ` Doug Smythies 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2019-09-24 8:06 UTC (permalink / raw) To: Doug Smythies Cc: 'Giovanni Gherdovich', x86, linux-pm, linux-kernel, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote: > On 2019.09.17 07:25 Giovanni Gherdovich wrote: > >On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: > > [...] > > >> The problem with the test is its run to run variability, which was from > >> all the disk I/O, as far as I could determine. At the time, > >> I studied this to death [2], and made a more repeatable test, without > >> any disk I/O. > >> > >> While the challenges with this work flow have tended to be focused > >> on the CPU frequency scaling driver, I have always considered > >> the root issue here to be a scheduling issue. Excerpt from my notes > >> [2]: > >> > >>> The issue is that performance is much much better if the system is > >>> forced to use only 1 CPU rather than relying on the defaults where > >>> the CPU scheduler decides what to do. > >>> The scheduler seems to not realize that the current CPU has just > >>> become free, and assigns the new task to a new CPU. Thus the load > >>> on any one CPU is so low that it doesn't ramp up the CPU frequency. > >>> It would be better if somehow the scheduler knew that the current > >>> active CPU was now able to take on the new task, overall resulting > >>> on one fully loaded CPU at the highest CPU frequency. > >> > >> I do not know if such is practical, and I didn't re-visit the issue. > >> > > > > You're absolutely right, pinning a serialized, fork-intensive workload such as > > gitsource gives you as good of a performance as you can get, because it removes > > the scheduler out of the picture. > > > > So one might be tempted to flag this test as non-representative of a > > real-world scenario; > > Disagree. I consider this test to be very representative of real-world > scenarios. However, and I do not know for certain, the relatively high > average fork rate of the gitsource "make test" is less common. > I think it's common enough to be interesting. What I would be very cautious of is considering this patch in the context of the scheduler decisions made for synchronous tasks. By synchronous, I mean any waker/wakee pattern where the waker always goes immediately to sleep. In that case, it is best for the wakee to use the same CPU as the waker. Unfortunately, the kernel has tried numerous times to accurately detect when a waker will immediately go to sleep and it has never worked out properly. When the sync wakeup hint was strictly obeyed, there were too many cases where the waker did not immediately sleep and there was a latency hit for the wakee when nearby cores were idle. `perf sched pipe is an excellent example of a case where staking the wakee on the same CPU as the waker performs excellently but there are too many other realistic workloads where it is a sub-optimal decision such as a waker waking multiple wakees before it goes to sleep meaning stacking should definitely not happen. Hence, I think this patchset should be considered on its own merits. There will always be some guesswork when deciding what factor to use to account for turbo but the patch is still better than allowing the estimated utilisation to vary depending on the CPU frequency. I think the patch is fine and should be merged with the main caveat being that some CPU families may need to use a different calculation to account for turbo boost which is a per-arch and per-cpu-family decision. What, if anything, should change in this patchset before it can be merged? Even if there is follow-on work that is necessary then it still looks like a reasonable starting point to me. If the waker/wakee stacking problem was revisited, it would still be orthogonal to this patch and they would not be in conflict. > > I think your 0.69 and my three values tell the same story: schedutil really > > needs to use the frequency invariant formula otherwise it's out of the > > race. Enabling scale-invariance gives multple tens of percent point in > > advantage. > > Agreed. This frequency invariant addition is great. However, if > schedutil is "out of the race" without it, as you say, then isn't > intel_pstate(passive)/ondemand out of the race also? It performs > just as poorly for this test, until very low PIDs per second. > In the intel_pstate case, there have been hacks carried out of tree trying to avoid some of the downsides of it. It also had things like IO wait boosting in mainline which was partially to handle the case where history was lost and in some cases to avoid problems when the wakup on IO completion moved a task to another CPU. I think it's a fair assessment to say that schedutil suffers if frequency invariance is not used regardless of what the other cpufreq drivers do. > >>>> Compare it to the update formula of intel_pstate/powersave: > >>> > >>> freq_next = 1.25 * freq_max * Busy% > >>> > >>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent > >>> idling (calculated with delta_MPERF / delta_TSC); > >> > >> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old > >> method of utilization does not (at least not last time I investigated, which was > >> awhile ago (and I can not find my notes)). > > > > I think that depends on whether or not TSC stops at idle. As understand from > > the Intel Software Developer manual (SDM) a TSC that stops at idle is called > > "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the > > two counters behaves exactly the same and the ratio is always 1, modulo the > > delays in actually reading the two values. But all I know comes from > > turbostat's man page and the SDM, so don't quote me on that :) > > I was only talking about idle state 0 (polling), where TSC does not stop. > > By the way, I have now done some tests with this patch set and multi-threaded > stuff. Nothing to report, it all looks great. > > [1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png > Is that an acked-by? Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-24 8:06 ` Mel Gorman @ 2019-09-24 17:52 ` Doug Smythies 0 siblings, 0 replies; 27+ messages in thread From: Doug Smythies @ 2019-09-24 17:52 UTC (permalink / raw) To: 'Mel Gorman' Cc: 'Giovanni Gherdovich', x86, linux-pm, linux-kernel, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw On 2019.09.24 01:06 Mel Gorman wrote: > On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote: >> On 2019.09.17 07:25 Giovanni Gherdovich wrote: >>>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: >>> [...] > > Hence, I think this patchset should be considered on its own merits. Agree. > I think the patch is fine and should be merged with the main caveat being > that some CPU families may need to use a different calculation to account > for turbo boost which is a per-arch and per-cpu-family decision. Agree. > What, if anything, should change in this patchset before it can be merged? Nothing, and apologies for the tangential discussion. > Is that an acked-by? Absolutely, if I am worthy of ack'ing then: Acked-by: Doug Smythies <dsmythies@telus.net> ... Doug ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-09-11 15:28 ` Doug Smythies @ 2019-09-13 22:52 ` Srinivas Pandruvada 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-14 10:57 ` Quentin Perret ` (2 subsequent siblings) 4 siblings, 1 reply; 27+ messages in thread From: Srinivas Pandruvada @ 2019-09-13 22:52 UTC (permalink / raw) To: Giovanni Gherdovich, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote: ... > + > +/* > + * APERF/MPERF frequency ratio computation. > + * > + * The scheduler wants to do frequency invariant accounting and > needs a <1 > + * ratio to account for the 'current' frequency, corresponding to > + * freq_curr / freq_max. I thought this is no longer the restriction and Vincent did some work to remove this restriction. Thanks, Srinivas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-13 22:52 ` Srinivas Pandruvada @ 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 15:55 ` Vincent Guittot 2019-09-19 23:55 ` Srinivas Pandruvada 0 siblings, 2 replies; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-17 14:27 UTC (permalink / raw) To: Srinivas Pandruvada, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann Hello Srinivas, On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote: > On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote: > > ... > > > + > > +/* > > + * APERF/MPERF frequency ratio computation. > > + * > > + * The scheduler wants to do frequency invariant accounting and > > needs a <1 > > + * ratio to account for the 'current' frequency, corresponding to > > + * freq_curr / freq_max. > > I thought this is no longer the restriction and Vincent did some work > to remove this restriction. If you're referring to the patch 23127296889f "sched/fair: Update scale invariance of PELT" merged in v5.2, I'm familiar with that and from my understanding you still want a <1 scaling factor. This is my recalling of the patch: Vincent was studying some synthetic traces and realized that util_avg reported by PELT didn't quite match the result you'd get computing the formula with pen and paper (theoretical value). To address this he changed where the scaling factor is applied in the PELT formula. At some point when accumulating the PELT sums, you'll have to measure the time 'delta' since you last updated PELT. What we have after Vincent's change is that this time length 'delta' gets itself scaled by the freq_curr/freq_max ratio: delta = time since last PELT update delta *= freq_percent In this way time goes at "wall clock speed" only when you're running at max capacitiy, and goes "slower" (from the PELT point of view) if we're running at a lower frequency. I don't think Vincent had in mind a faster-than-wall-clock PELT time (which you'd get w/ freq_percent>1). Speaking of which, Srinivas, do you have any opinion and/or requirement about this? I confusely remember Peter Zijlstra saying (more than a year ago, now) that you would like an unclipped freq_curr/freq_max ratio, and may not be happy with this patch clipping it to 1 when freq_curr > 4_cores_turbo. If that's the case, could you elaborate on this? Ignore that if it doesn't make sense, I may be mis-remembering. Thanks, Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-17 14:27 ` Giovanni Gherdovich @ 2019-09-17 15:55 ` Vincent Guittot 2019-09-19 23:55 ` Srinivas Pandruvada 1 sibling, 0 replies; 27+ messages in thread From: Vincent Guittot @ 2019-09-17 15:55 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, bp, Cc: Len Brown, Rafael J. Wysocki, x86, open list:THERMAL, linux-kernel, Mel Gorman, Matt Fleming, viresh kumar, Juri Lelli, Paul Turner, qperret, Dietmar Eggemann On Tue, 17 Sep 2019 at 16:21, Giovanni Gherdovich <ggherdovich@suse.cz> wrote: > > Hello Srinivas, > > On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote: > > On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote: > > > > ... > > > > > + > > > +/* > > > + * APERF/MPERF frequency ratio computation. > > > + * > > > + * The scheduler wants to do frequency invariant accounting and > > > needs a <1 > > > + * ratio to account for the 'current' frequency, corresponding to > > > + * freq_curr / freq_max. > > > > I thought this is no longer the restriction and Vincent did some work > > to remove this restriction. > > If you're referring to the patch > > 23127296889f "sched/fair: Update scale invariance of PELT" > > merged in v5.2, I'm familiar with that and from my understanding you still > want a <1 scaling factor. This is my recalling of the patch: > > Vincent was studying some synthetic traces and realized that util_avg reported > by PELT didn't quite match the result you'd get computing the formula with pen > and paper (theoretical value). To address this he changed where the scaling > factor is applied in the PELT formula. > > At some point when accumulating the PELT sums, you'll have to measure the time > 'delta' since you last updated PELT. What we have after Vincent's change is > that this time length 'delta' gets itself scaled by the freq_curr/freq_max > ratio: > > delta = time since last PELT update > delta *= freq_percent > > In this way time goes at "wall clock speed" only when you're running at max > capacitiy, and goes "slower" (from the PELT point of view) if we're running at > a lower frequency. I don't think Vincent had in mind a faster-than-wall-clock > PELT time (which you'd get w/ freq_percent>1). Yes, I haven't really planned to have time going faster that wall clock but I don't see any algorithm problem at least if that would be the case. There will be a reduced maximum delta update of clock pelt but that will still be large enough > > Speaking of which, Srinivas, do you have any opinion and/or requirement about > this? I confusely remember Peter Zijlstra saying (more than a year ago, now) > that you would like an unclipped freq_curr/freq_max ratio, and may not be > happy with this patch clipping it to 1 when freq_curr > 4_cores_turbo. If > that's the case, could you elaborate on this? > Ignore that if it doesn't make sense, I may be mis-remembering. > > > Thanks, > Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 15:55 ` Vincent Guittot @ 2019-09-19 23:55 ` Srinivas Pandruvada 1 sibling, 0 replies; 27+ messages in thread From: Srinivas Pandruvada @ 2019-09-19 23:55 UTC (permalink / raw) To: Giovanni Gherdovich, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Tue, 2019-09-17 at 16:27 +0200, Giovanni Gherdovich wrote: > Hello Srinivas, > > On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote: > > On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote: > > > > ... > > > > > + > > > +/* > > > + * APERF/MPERF frequency ratio computation. > > > + * > > > + * The scheduler wants to do frequency invariant accounting and > > > needs a <1 > > > + * ratio to account for the 'current' frequency, corresponding > > > to > > > + * freq_curr / freq_max. > > > > I thought this is no longer the restriction and Vincent did some > > work > > to remove this restriction. > > If you're referring to the patch > > 23127296889f "sched/fair: Update scale invariance of PELT" > > merged in v5.2, I'm familiar with that and from my understanding you > still > want a <1 scaling factor. This is my recalling of the patch: > > Vincent was studying some synthetic traces and realized that util_avg > reported > by PELT didn't quite match the result you'd get computing the formula > with pen > and paper (theoretical value). To address this he changed where the > scaling > factor is applied in the PELT formula. > > At some point when accumulating the PELT sums, you'll have to measure > the time > 'delta' since you last updated PELT. What we have after Vincent's > change is > that this time length 'delta' gets itself scaled by the > freq_curr/freq_max > ratio: > > delta = time since last PELT update > delta *= freq_percent > > In this way time goes at "wall clock speed" only when you're running > at max > capacitiy, and goes "slower" (from the PELT point of view) if we're > running at > a lower frequency. I don't think Vincent had in mind a faster-than- > wall-clock > PELT time (which you'd get w/ freq_percent>1). > > Speaking of which, Srinivas, do you have any opinion and/or > requirement about > this? I confusely remember Peter Zijlstra saying (more than a year > ago, now) > that you would like an unclipped freq_curr/freq_max ratio, and may > not be > happy with this patch clipping it to 1 when freq_curr > > 4_cores_turbo. If > that's the case, could you elaborate on this? > Ignore that if it doesn't make sense, I may be mis-remembering. I was thinking of power efficiency use case particularly for Atom like platforms, 1C max as you observed is more efficient. But now sched deadline code is using arch_scale_freq_capacity(() to calculate dl_se->runtime, where closer to deterministic value with all cores, may be better, which will be scaled with base_freq. Thanks, Srinivas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-09-11 15:28 ` Doug Smythies 2019-09-13 22:52 ` Srinivas Pandruvada @ 2019-09-14 10:57 ` Quentin Perret 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-24 16:04 ` Peter Zijlstra 2019-09-24 16:30 ` Peter Zijlstra 4 siblings, 1 reply; 27+ messages in thread From: Quentin Perret @ 2019-09-14 10:57 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann Hi Giovanni On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote: > +static inline long arch_scale_freq_capacity(int cpu) > +{ > + if (static_cpu_has(X86_FEATURE_APERFMPERF)) > + return per_cpu(arch_cpu_freq, cpu); So, if this is conditional, perhaps you could also add this check in an x86-specific implementation of arch_scale_freq_invariant() ? That would guide sugov in the right path (see get_next_freq()) if APERF/MPERF are unavailable. > + return 1024 /* SCHED_CAPACITY_SCALE */; > +} Thanks, Quentin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-14 10:57 ` Quentin Perret @ 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 14:39 ` Quentin Perret 2019-09-24 14:03 ` Peter Zijlstra 0 siblings, 2 replies; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-17 14:27 UTC (permalink / raw) To: Quentin Perret Cc: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann Hello Quentin, On Sat, 2019-09-14 at 12:57 +0200, Quentin Perret wrote: > Hi Giovanni > > On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote: > > +static inline long arch_scale_freq_capacity(int cpu) > > +{ > > + if (static_cpu_has(X86_FEATURE_APERFMPERF)) > > + return per_cpu(arch_cpu_freq, cpu); > > So, if this is conditional, perhaps you could also add this check in an > x86-specific implementation of arch_scale_freq_invariant() ? That would > guide sugov in the right path (see get_next_freq()) if APERF/MPERF are > unavailable. > > > + return 1024 /* SCHED_CAPACITY_SCALE */; > > +} > Good remark. If the cpu doesn't have APERF/MPERF, the choice here is that freq_curr is constantly equal to freq_max, and the scaling factor is 1 all the time. But I'm checking this static_cpu_has() every time I do a frequency update; arguably schedutil should be smarter and settle such a case once and for all at boot time. I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll do what you suggest (x86-specific version of arch_scale_freq_invariant(). Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-17 14:27 ` Giovanni Gherdovich @ 2019-09-17 14:39 ` Quentin Perret 2019-09-24 14:03 ` Peter Zijlstra 1 sibling, 0 replies; 27+ messages in thread From: Quentin Perret @ 2019-09-17 14:39 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann On Tuesday 17 Sep 2019 at 16:27:46 (+0200), Giovanni Gherdovich wrote: > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > do what you suggest (x86-specific version of arch_scale_freq_invariant(). In case this is indeed expensive to check, you could always add a static key, set at boot time, to optimize things a bit ... That might be worth it since this is called in latency-sensitive paths of the scheduler. Thanks, Quentin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 14:39 ` Quentin Perret @ 2019-09-24 14:03 ` Peter Zijlstra 2019-09-24 16:00 ` Peter Zijlstra 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2019-09-24 14:03 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Quentin Perret, srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann On Tue, Sep 17, 2019 at 04:27:46PM +0200, Giovanni Gherdovich wrote: > Hello Quentin, > > On Sat, 2019-09-14 at 12:57 +0200, Quentin Perret wrote: > > Hi Giovanni > > > > On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote: > > > +static inline long arch_scale_freq_capacity(int cpu) > > > +{ > > > + if (static_cpu_has(X86_FEATURE_APERFMPERF)) > > > + return per_cpu(arch_cpu_freq, cpu); > > > > So, if this is conditional, perhaps you could also add this check in an > > x86-specific implementation of arch_scale_freq_invariant() ? That would > > guide sugov in the right path (see get_next_freq()) if APERF/MPERF are > > unavailable. > > > > > + return 1024 /* SCHED_CAPACITY_SCALE */; > > > +} > > > > Good remark. If the cpu doesn't have APERF/MPERF, the choice here is that > freq_curr is constantly equal to freq_max, and the scaling factor is 1 all the > time. > > But I'm checking this static_cpu_has() every time I do a frequency update; > arguably schedutil should be smarter and settle such a case once and for all > at boot time. > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > do what you suggest (x86-specific version of arch_scale_freq_invariant(). static_cpu_has() is an alternative and ends up being a static branch (similar to static_key) once the alternative patching runs. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-24 14:03 ` Peter Zijlstra @ 2019-09-24 16:00 ` Peter Zijlstra 2019-10-02 12:27 ` Giovanni Gherdovich 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2019-09-24 16:00 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Quentin Perret, srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote: > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > > do what you suggest (x86-specific version of arch_scale_freq_invariant(). > > static_cpu_has() is an alternative and ends up being a static branch > (similar to static_key) once the alternative patching runs. That said; I think you want a static key anyway, because if we can't tell the max_freq we don't want to use the invariant stuff. Something a little like so on top perhaps. Also, the below fixes that silly tick_disable stuff. --- --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -196,20 +196,24 @@ static inline void sched_clear_itmt_supp #ifdef CONFIG_SMP #include <asm/cpufeature.h> -#define arch_scale_freq_tick arch_scale_freq_tick -#define arch_scale_freq_capacity arch_scale_freq_capacity +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) DECLARE_PER_CPU(unsigned long, arch_cpu_freq); static inline long arch_scale_freq_capacity(int cpu) { - if (static_cpu_has(X86_FEATURE_APERFMPERF)) + if (arch_scale_freq_invariant()) return per_cpu(arch_cpu_freq, cpu); return 1024 /* SCHED_CAPACITY_SCALE */; } +#define arch_scale_freq_capacity arch_scale_freq_capacity extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + extern void x86_arch_scale_freq_tick_enable(void); extern void x86_arch_scale_freq_tick_disable(void); #else --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1799,6 +1799,8 @@ void native_play_dead(void) * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. */ +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + static DEFINE_PER_CPU(u64, arch_prev_aperf); static DEFINE_PER_CPU(u64, arch_prev_mperf); static u64 arch_max_freq = SCHED_CAPACITY_SCALE; @@ -1860,6 +1862,8 @@ static void core_set_cpu_max_freq(void) turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); + + static_key_enable(&arch_scale_freq_key); } static void intel_set_cpu_max_freq(void) @@ -1876,10 +1880,19 @@ static void intel_set_cpu_max_freq(void) core_set_cpu_max_freq(); } -static void set_cpu_max_freq(void) +static void init_scale_freq(void *arg) { u64 aperf, mperf; + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void set_cpu_max_freq(void) +{ if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) return; @@ -1891,11 +1904,7 @@ static void set_cpu_max_freq(void) break; } - rdmsrl(MSR_IA32_APERF, aperf); - rdmsrl(MSR_IA32_MPERF, mperf); - - this_cpu_write(arch_prev_aperf, aperf); - this_cpu_write(arch_prev_mperf, mperf); + init_scale_freq(NULL); } DEFINE_PER_CPU(unsigned long, arch_cpu_freq); @@ -1908,7 +1917,7 @@ void arch_scale_freq_tick(void) u64 aperf, mperf; u64 acnt, mcnt; - if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable) + if (!arch_scale_freq_invariant() || tick_disable) return; rdmsrl(MSR_IA32_APERF, aperf); @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi void x86_arch_scale_freq_tick_disable(void) { + on_each_cpu(init_scale_freq, NULL, 1); tick_disable = true; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-24 16:00 ` Peter Zijlstra @ 2019-10-02 12:27 ` Giovanni Gherdovich 2019-10-02 18:45 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Quentin Perret, srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann On Tue, 2019-09-24 at 18:00 +0200, Peter Zijlstra wrote: > On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote: > > > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > > > do what you suggest (x86-specific version of arch_scale_freq_invariant(). > > > > static_cpu_has() is an alternative and ends up being a static branch > > (similar to static_key) once the alternative patching runs. > > That said; I think you want a static key anyway, because if we can't > tell the max_freq we don't want to use the invariant stuff. > > Something a little like so on top perhaps. > > Also, the below fixes that silly tick_disable stuff. Thanks for this patch, I'll add this change in v2. Can you elaborate on what you don't like in the tick_disable mechanism? After reading your comments I realized there is a problem, but I'm not sure is the same you're addressing. More on this below, under your edit of the function x86_arch_scale_freq_tick_disable(). > > --- > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -196,20 +196,24 @@ static inline void sched_clear_itmt_supp > #ifdef CONFIG_SMP > #include <asm/cpufeature.h> > > -#define arch_scale_freq_tick arch_scale_freq_tick > -#define arch_scale_freq_capacity arch_scale_freq_capacity > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) This confused me for a second but then I realized that this #define comes before the one in kernel/sched/sched.h where arch_scale_freq_invariant() is defined again but guarded against previous definitions, so it all falls into place; code from schedutil will see this one. > > DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > > static inline long arch_scale_freq_capacity(int cpu) > { > - if (static_cpu_has(X86_FEATURE_APERFMPERF)) > + if (arch_scale_freq_invariant()) > return per_cpu(arch_cpu_freq, cpu); > > return 1024 /* SCHED_CAPACITY_SCALE */; > } > +#define arch_scale_freq_capacity arch_scale_freq_capacity > > extern void arch_scale_freq_tick(void); > +#define arch_scale_freq_tick arch_scale_freq_tick > + > extern void x86_arch_scale_freq_tick_enable(void); > extern void x86_arch_scale_freq_tick_disable(void); > #else > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1799,6 +1799,8 @@ void native_play_dead(void) > * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. > */ > > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > static DEFINE_PER_CPU(u64, arch_prev_aperf); > static DEFINE_PER_CPU(u64, arch_prev_mperf); > static u64 arch_max_freq = SCHED_CAPACITY_SCALE; > @@ -1860,6 +1862,8 @@ static void core_set_cpu_max_freq(void) > turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ > > arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); > + > + static_key_enable(&arch_scale_freq_key); > } > > static void intel_set_cpu_max_freq(void) > @@ -1876,10 +1880,19 @@ static void intel_set_cpu_max_freq(void) > core_set_cpu_max_freq(); > } > > -static void set_cpu_max_freq(void) > +static void init_scale_freq(void *arg) > { > u64 aperf, mperf; > > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > +} > + > +static void set_cpu_max_freq(void) > +{ > if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > return; > > @@ -1891,11 +1904,7 @@ static void set_cpu_max_freq(void) > break; > } > > - rdmsrl(MSR_IA32_APERF, aperf); > - rdmsrl(MSR_IA32_MPERF, mperf); > - > - this_cpu_write(arch_prev_aperf, aperf); > - this_cpu_write(arch_prev_mperf, mperf); > + init_scale_freq(NULL); > } > > DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > @@ -1908,7 +1917,7 @@ void arch_scale_freq_tick(void) > u64 aperf, mperf; > u64 acnt, mcnt; > > - if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable) > + if (!arch_scale_freq_invariant() || tick_disable) > return; > > rdmsrl(MSR_IA32_APERF, aperf); > @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi > > void x86_arch_scale_freq_tick_disable(void) > { > + on_each_cpu(init_scale_freq, NULL, 1); > tick_disable = true; I don't see why the call init_scale_freq() here is needed; why would I care of what's in arch_prev_[am]perf at this point. arch_scale_freq_tick() will see that tick_disable == true and exit early before reading arch_prev_[am]perf. The problem IMO emerges in the following configuration, which is a bug in the patch I sent: * arch_scale_freq_invariant() is true (because we have APERF/MPERF) * arch_scale_freq_capacity() is non-trivial (reads arch_cpu_freq) * tick calculations are disabled In this case arch_scale_freq_capacity() feeds stale data to the function update_rq_clock_pelt() in kernel/sched/pelt.h. I initially missed this problem because I forgot that PELT signals have more users than just the schedutil governor (load balancer etc). This is exactly the situation produced by patch 2/2 which disables the tick calculations for intel_cpufreq (aka intel_pstate=passive). I think the fix for this is to set arch_cpu_freq (each per-cpu instance of the variable) to SCHED_CAPACITY_SCALE here in x86_arch_scale_freq_tick_disable(). That would render the scaling factor for invariance moot (always 1), just as it is w/o scale invariance. I'm sending v2 with all your amendmends except this last one. Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:27 ` Giovanni Gherdovich @ 2019-10-02 18:45 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2019-10-02 18:45 UTC (permalink / raw) To: Giovanni Gherdovich Cc: Quentin Perret, srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, dietmar.eggemann On Wed, Oct 02, 2019 at 02:27:54PM +0200, Giovanni Gherdovich wrote: > On Tue, 2019-09-24 at 18:00 +0200, Peter Zijlstra wrote: > > On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote: > > > > > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > > > > do what you suggest (x86-specific version of arch_scale_freq_invariant(). > > > > > > static_cpu_has() is an alternative and ends up being a static branch > > > (similar to static_key) once the alternative patching runs. > > > > That said; I think you want a static key anyway, because if we can't > > tell the max_freq we don't want to use the invariant stuff. > > > > Something a little like so on top perhaps. > > > > Also, the below fixes that silly tick_disable stuff. > > Thanks for this patch, I'll add this change in v2. > > Can you elaborate on what you don't like in the tick_disable mechanism? Mostly because I dislike intel_pstate active mode a lot, but also because it makes PELT behave differently between pstate and !pstate. > > +static void init_scale_freq(void *arg) > > { > > u64 aperf, mperf; > > > > + rdmsrl(MSR_IA32_APERF, aperf); > > + rdmsrl(MSR_IA32_MPERF, mperf); > > + > > + this_cpu_write(arch_prev_aperf, aperf); > > + this_cpu_write(arch_prev_mperf, mperf); > > +} > > + > > @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi > > > > void x86_arch_scale_freq_tick_disable(void) > > { > > + on_each_cpu(init_scale_freq, NULL, 1); > > tick_disable = true; > > I don't see why the call init_scale_freq() here is needed; why would I care of > what's in arch_prev_[am]perf at this point. arch_scale_freq_tick() will see > that tick_disable == true and exit early before reading arch_prev_[am]perf. You're right, we should reset the prev values on enable. Otherwise the first tick after enable will see 'weird' values. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich ` (2 preceding siblings ...) 2019-09-14 10:57 ` Quentin Perret @ 2019-09-24 16:04 ` Peter Zijlstra 2019-10-02 12:26 ` Giovanni Gherdovich 2019-09-24 16:30 ` Peter Zijlstra 4 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2019-09-24 16:04 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote: > +static void intel_set_cpu_max_freq(void) > +{ > + /* > + * TODO: add support for: > + * > + * - Xeon Phi (KNM, KNL) > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > + * - Atom Silvermont ISTR I had code for Atom.. what happened with that? > + * > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > + */ > + core_set_cpu_max_freq(); > +} ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-24 16:04 ` Peter Zijlstra @ 2019-10-02 12:26 ` Giovanni Gherdovich 2019-10-02 18:35 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:26 UTC (permalink / raw) To: Peter Zijlstra Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Tue, 2019-09-24 at 18:04 +0200, Peter Zijlstra wrote: > On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote: > > > +static void intel_set_cpu_max_freq(void) > > +{ > > + /* > > + * TODO: add support for: > > + * > > + * - Xeon Phi (KNM, KNL) > > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > > + * - Atom Silvermont > > ISTR I had code for Atom.. what happened with that? I'm being overly zealous and I wanted to get a Silvermont machine to test that code before sending. The reason is that your code uses MSR_ATOM_CORE_RATIOS and MSR_ATOM_CORE_TURBO_RATIOS which are not documented in the SDM. I wanted to make sure those have the expected content on at least one machine before using them in my code. I have no doubt you, Srinivas and Len (who uses them in turbostat) have already checked but you know, more eyeballs. I've talked to Len and Srinivas at LPC, they agreed that those two MSR may not have made it to the SDM but said the turbostat source code is the reference in this case. Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:26 ` Giovanni Gherdovich @ 2019-10-02 18:35 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2019-10-02 18:35 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Wed, Oct 02, 2019 at 02:26:44PM +0200, Giovanni Gherdovich wrote: > On Tue, 2019-09-24 at 18:04 +0200, Peter Zijlstra wrote: > > On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote: > > > > > +static void intel_set_cpu_max_freq(void) > > > +{ > > > + /* > > > + * TODO: add support for: > > > + * > > > + * - Xeon Phi (KNM, KNL) > > > + * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus > > > + * - Atom Silvermont > > > > ISTR I had code for Atom.. what happened with that? > > I'm being overly zealous and I wanted to get a Silvermont machine to test that > code before sending. > > The reason is that your code uses MSR_ATOM_CORE_RATIOS and > MSR_ATOM_CORE_TURBO_RATIOS which are not documented in the SDM. I wanted to > make sure those have the expected content on at least one machine before using > them in my code. I have no doubt you, Srinivas and Len (who uses them in > turbostat) have already checked but you know, more eyeballs. > > I've talked to Len and Srinivas at LPC, they agreed that those two MSR may not > have made it to the SDM but said the turbostat source code is the reference in > this case. Can you at least include the patch as RFC then? Perhaps other people, who have hardware at hand, can then help test it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich ` (3 preceding siblings ...) 2019-09-24 16:04 ` Peter Zijlstra @ 2019-09-24 16:30 ` Peter Zijlstra 2019-10-02 12:25 ` Giovanni Gherdovich 4 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2019-09-24 16:30 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote: > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_X), That's GOLDMONT_D in recent tip kernels. > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > + ICPU(INTEL_FAM6_SKYLAKE_X), What about KABYLAKE_X and ICELAKE_X ? > + {} > +}; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-09-24 16:30 ` Peter Zijlstra @ 2019-10-02 12:25 ` Giovanni Gherdovich 2019-10-02 18:47 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Giovanni Gherdovich @ 2019-10-02 12:25 UTC (permalink / raw) To: Peter Zijlstra Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann Hello Peter, late replies as I wasn't in the office last week. On Tue, 2019-09-24 at 18:30 +0200, Peter Zijlstra wrote: > On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote: > > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = { > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT), > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_X), > > That's GOLDMONT_D in recent tip kernels. Right, I saw that now. > > > + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), > > + ICPU(INTEL_FAM6_SKYLAKE_X), > > What about KABYLAKE_X and ICELAKE_X ? KABYLAKE_X: does it exist? I couldn't find it in arch/x86/include/asm/intel-family.h (the tip tree), I only see KABYLAKE_L and KABYLAKE. ICELAKE_X: well, I don't know really. Does this model have the same semantic for MSR_TURBO_RATIO_LIMIT as SKYLAKE_X (which is family = 0x6, model = 0x55)? This is for Len B. and Srinivas P. (in CC). The latest Software Developer's Manual (SDM) from May 2019 (volume 4, section 2.17.3, "MSRs Specific to Intel Xeon Processor Scalable Family") mentions only "CPUID DisplayFamily_DisplayModel = 06_55H", which is SKYLAKE_X, as having the semantic I'm looking for here (in addition to Atom Goldmont's). The semantic I'm referring to is that MSR_TURBO_RATIO_LIMIT doesn't contain turbo levels for the fixed group sizes 1-2-3-4-... cores, the group sizes are specified in a different MSR (and could be 2-4-8-12-... for example). If the SDM is outdated and ICELAKE_X is also in that category, then the turbostat source code is outdated too as it has this function to detect this feature: int has_turbo_ratio_group_limits(int family, int model) { if (!genuine_intel) return 0; switch (model) { case INTEL_FAM6_ATOM_GOLDMONT: case INTEL_FAM6_SKYLAKE_X: case INTEL_FAM6_ATOM_GOLDMONT_X: return 1; } return 0; } (from the tree lenb/linux.git, branch "turbostat", turbostat version 19.08.31 not yet merged into mainline) Giovanni ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] x86,sched: Add support for frequency invariance 2019-10-02 12:25 ` Giovanni Gherdovich @ 2019-10-02 18:47 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2019-10-02 18:47 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Wed, Oct 02, 2019 at 02:25:52PM +0200, Giovanni Gherdovich wrote: > > What about KABYLAKE_X and ICELAKE_X ? > > KABYLAKE_X: does it exist? I couldn't find it in > arch/x86/include/asm/intel-family.h (the tip tree), I only see KABYLAKE_L and > KABYLAKE. My bad, I must've been staring cross-eyed at intel-family.h. > If the SDM is outdated and ICELAKE_X is also in that category, then the > turbostat source code is outdated too as it has this function to detect this > feature: I think you can trust the turbostat code. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting 2019-09-09 2:42 [PATCH 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich @ 2019-09-09 2:42 ` Giovanni Gherdovich 2019-09-24 16:01 ` [PATCH 0/2] Add support for frequency invariance for (some) x86 Peter Zijlstra 2 siblings, 0 replies; 27+ messages in thread From: Giovanni Gherdovich @ 2019-09-09 2:42 UTC (permalink / raw) To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw Cc: x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann, Giovanni Gherdovich From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> intel_pstate has two operating modes: active and passive. In "active" mode, the in-built scaling governor is used and in "passive" mode, the driver can be used with any governor like "schedutil". In "active" mode the utilization values from schedutil is not used and there is a requirement from high performance computing use cases, not to read any APERF/MPERF MSRs. In this case no need to use CPU cycles for frequency invariant accounting by reading APERF/MPERF MSRs. With this change frequency invariant account is only enabled in "passive" mode. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> --- drivers/cpufreq/intel_pstate.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index cc27d4c59dca..d55da8604d50 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -2381,6 +2381,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) { int ret; + x86_arch_scale_freq_tick_disable(); + memset(&global, 0, sizeof(global)); global.max_perf_pct = 100; @@ -2393,6 +2395,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver) global.min_perf_pct = min_perf_pct_min(); + if (driver == &intel_cpufreq) + x86_arch_scale_freq_tick_enable(); + return 0; } -- 2.16.4 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 0/2] Add support for frequency invariance for (some) x86 2019-09-09 2:42 [PATCH 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-09-09 2:42 ` [PATCH 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich @ 2019-09-24 16:01 ` Peter Zijlstra 2 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2019-09-24 16:01 UTC (permalink / raw) To: Giovanni Gherdovich Cc: srinivas.pandruvada, tglx, mingo, bp, lenb, rjw, x86, linux-pm, linux-kernel, mgorman, matt, viresh.kumar, juri.lelli, pjt, vincent.guittot, qperret, dietmar.eggemann On Mon, Sep 09, 2019 at 04:42:14AM +0200, Giovanni Gherdovich wrote: > This is a resend with of Peter Zijlstra's patch to support frequency > scale-invariance on x86 from May 2018 [see 1]. I've added some modifications > and included performance test results. If Peter doesn't mind, I'm slapping my > name on it :) That is fine; all I did was write some code, you did the hard part and made it 'work' :-) ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2019-10-02 18:48 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-09 2:42 [PATCH 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich 2019-09-09 2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich 2019-09-11 15:28 ` Doug Smythies 2019-09-13 20:58 ` Doug Smythies 2019-09-17 14:25 ` Giovanni Gherdovich 2019-09-19 14:42 ` Doug Smythies 2019-09-24 8:06 ` Mel Gorman 2019-09-24 17:52 ` Doug Smythies 2019-09-13 22:52 ` Srinivas Pandruvada 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 15:55 ` Vincent Guittot 2019-09-19 23:55 ` Srinivas Pandruvada 2019-09-14 10:57 ` Quentin Perret 2019-09-17 14:27 ` Giovanni Gherdovich 2019-09-17 14:39 ` Quentin Perret 2019-09-24 14:03 ` Peter Zijlstra 2019-09-24 16:00 ` Peter Zijlstra 2019-10-02 12:27 ` Giovanni Gherdovich 2019-10-02 18:45 ` Peter Zijlstra 2019-09-24 16:04 ` Peter Zijlstra 2019-10-02 12:26 ` Giovanni Gherdovich 2019-10-02 18:35 ` Peter Zijlstra 2019-09-24 16:30 ` Peter Zijlstra 2019-10-02 12:25 ` Giovanni Gherdovich 2019-10-02 18:47 ` Peter Zijlstra 2019-09-09 2:42 ` [PATCH 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich 2019-09-24 16:01 ` [PATCH 0/2] Add support for frequency invariance for (some) x86 Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).