From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by aws-us-west-2-korg-lkml-1.web.codeaurora.org (Postfix) with ESMTP id C7F39C5CFF1 for ; Tue, 12 Jun 2018 15:05:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 73FAC208B0 for ; Tue, 12 Jun 2018 15:05:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 73FAC208B0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=rjwysocki.net Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934126AbeFLPFM (ORCPT ); Tue, 12 Jun 2018 11:05:12 -0400 Received: from cloudserver094114.home.pl ([79.96.170.134]:42035 "EHLO cloudserver094114.home.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932803AbeFLPFK (ORCPT ); Tue, 12 Jun 2018 11:05:10 -0400 Received: from 79.184.255.56.ipv4.supernova.orange.pl (79.184.255.56) (HELO aspire.rjw.lan) by serwer1319399.home.pl (79.96.170.134) with SMTP (IdeaSmtpServer 0.83) id c87c97b1dd7a3749; Tue, 12 Jun 2018 17:05:08 +0200 From: "Rafael J. Wysocki" To: Srinivas Pandruvada Cc: lenb@kernel.org, mgorman@techsingularity.net, peterz@infradead.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, juri.lelli@redhat.com, viresh.kumar@linaro.org, ggherdovich@suse.cz Subject: Re: [PATCH 0/4] Intel_pstate: HWP Dynamic performance boost Date: Tue, 12 Jun 2018 17:04:07 +0200 Message-ID: <4917436.OUaLPInWbq@aspire.rjw.lan> In-Reply-To: <20180605214242.62156-1-srinivas.pandruvada@linux.intel.com> References: <20180605214242.62156-1-srinivas.pandruvada@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tuesday, June 5, 2018 11:42:38 PM CEST Srinivas Pandruvada wrote: > v1 (Compared to RFC/RFT v3) > - Minor suggestion for intel_pstate for coding > - Add SKL desktop model used in some Xeons > > Tested-by: Giovanni Gherdovich > > This series has an overall positive performance impact on IO both on xfs and > ext4, and I'd be vary happy if it lands in v4.18. You dropped the migration > optimization from v1 to v2 after the reviewers' suggestion; I'm looking > forward to test that part too, so please add me to CC when you'll resend it. > > I've tested your series on a single socket Xeon E3-1240 v5 (Skylake, 4 cores / > 8 threads) with SSD storage. The platform is a Dell PowerEdge R230. > > The benchmarks used are a mix of I/O intensive workloads on ext4 and xfs > (dbench4, sqlite, pgbench in read/write and read-only configuration, Flexible > IO aka FIO, etc) and scheduler stressers just to check that everything is okay > in that department too (hackbench, pipetest, schbench, sockperf on localhost > both in "throughput" and "under-load" mode, netperf in localhost, etc). There > is also some HPC with the NAS Parallel Benchmark, as when using openMPI as IPC > mechanism it ends up being write-intensive and that could be a good > experiment, even if the HPC people aren't exactly the target audience for a > frequency governor. > > The large improvements are in areas you already highlighted in your cover > letter (dbench4, sqlite, and pgbench read/write too, very impressive > honestly). Minor wins are also observed in sockperf and running the git unit > tests (gitsource below). The scheduler stressers ends up, as expected, in the > "neutral" category where you'll also find FIO (which given other results I'd > have expected to improve a little at least). Marked "neutral" are also those > results where statistical significance wasn't reached (2 standard deviations, > which is roughly like a 0.05 p-value) even if they showed some difference in a > direction or the other. In the "small losses" section I found hackbench run > with processes (not threads) and pipes (not sockets) which I report for due > diligence but looking at the raw numbers it's more of a mixed bag than a real > loss, and the NAS high-perf computing benchmark when it uses openMP (as > opposed to openMPI) for IPC -- but again, we often find that supercomputers > people run the machines at full speed all the time. > > At the bottom of this message you'll find some directions if you want to run > some test yourself using the same framework I used, MMTests from > https://github.com/gormanm/mmtests (we store a fair amount of benchmarks > parametrization up there). > > Large wins: > > - dbench4: +20% on ext4, > +14% on xfs (always asynch IO) > - sqlite (insert): +9% on both ext4 and xfs > - pgbench (read/write): +9% on ext4, > +10% on xfs > > Moderate wins: > > - sockperf (type: under-load, localhost): +1% with TCP, > +5% with UDP > - gisource (git unit tests, shell intensive): +3% on ext4 > - NAS Parallel Benchmark (HPC, using openMPI, on xfs): +1% > - tbench4 (network part of dbench4, localhost): +1% > > Neutral: > > - pgbench (read-only) on ext4 and xfs > - siege > - netperf (streaming and round-robin) with TCP and UDP > - hackbench (sockets/process, sockets/thread and pipes/thread) > - pipetest > - Linux kernel build > - schbench > - sockperf (type: throughput) with TCP and UDP > - git unit tests on xfs > - FIO (both random and seq. read, both random and seq. write) > on ext4 and xfs, async IO > > Moderate losses: > > - hackbench (pipes/process): -10% > - NAS Parallel Benchmark with openMP: -1% > > > Each benchmark is run with a variety of configuration parameters (eg: number > of threads, number of clients, etc); to reach a final "score" the geometric > mean is used (with a few exceptions depending on the type of benchmark). > Detailed results follow. Amean, Hmean and Gmean are respectively arithmetic, > harmonic and geometric means. > > For brevity I won't report all tables but only those for "large wins" and > "moderate losses". Note that I'm not overly worried for the hackbench-pipes > situation, as we've studied it in the past and determined that such > configuration is particularly weak, time is mostly spent on contention and the > scheduler code path isn't exercised. See the comment in the file > configs/config-global-dhp__scheduler-unbound in MMTests for a brief > description of the issue. > > DBENCH4 > ======= > > NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8. > MMTESTS CONFIG: global-dhp__io-dbench4-async-{ext4, xfs} > MEASURES: latency (millisecs) > LOWER is better > > EXT4 > 4.16.0 4.16.0 > vanilla hwp-boost > Amean 1 28.49 ( 0.00%) 19.68 ( 30.92%) > Amean 2 26.70 ( 0.00%) 25.59 ( 4.14%) > Amean 4 54.59 ( 0.00%) 43.56 ( 20.20%) > Amean 8 91.19 ( 0.00%) 77.56 ( 14.96%) > Amean 64 538.09 ( 0.00%) 438.67 ( 18.48%) > Stddev 1 6.70 ( 0.00%) 3.24 ( 51.66%) > Stddev 2 4.35 ( 0.00%) 3.57 ( 17.85%) > Stddev 4 7.99 ( 0.00%) 7.24 ( 9.29%) > Stddev 8 17.51 ( 0.00%) 15.80 ( 9.78%) > Stddev 64 49.54 ( 0.00%) 46.98 ( 5.17%) > > XFS > 4.16.0 4.16.0 > vanilla hwp-boost > Amean 1 21.88 ( 0.00%) 16.03 ( 26.75%) > Amean 2 19.72 ( 0.00%) 19.82 ( -0.50%) > Amean 4 37.55 ( 0.00%) 29.52 ( 21.38%) > Amean 8 56.73 ( 0.00%) 51.83 ( 8.63%) > Amean 64 808.80 ( 0.00%) 698.12 ( 13.68%) > Stddev 1 6.29 ( 0.00%) 2.33 ( 62.99%) > Stddev 2 3.12 ( 0.00%) 2.26 ( 27.73%) > Stddev 4 7.56 ( 0.00%) 5.88 ( 22.28%) > Stddev 8 14.15 ( 0.00%) 12.49 ( 11.71%) > Stddev 64 380.54 ( 0.00%) 367.88 ( 3.33%) > > SQLITE > ====== > > NOTES: SQL insert test on a table that will be 2M in size. > MMTESTS CONFIG: global-dhp__db-sqlite-insert-medium-{ext4, xfs} > MEASURES: transactions per second > HIGHER is better > > EXT4 > 4.16.0 4.16.0 > vanilla hwp-boost > Hmean Trans 2098.79 ( 0.00%) 2292.16 ( 9.21%) > Stddev Trans 78.79 ( 0.00%) 95.73 ( -21.50%) > > XFS > 4.16.0 4.16.0 > vanilla hwp-boost > Hmean Trans 1890.27 ( 0.00%) 2058.62 ( 8.91%) > Stddev Trans 52.54 ( 0.00%) 29.56 ( 43.73%) > > PGBENCH-RW > ========== > > NOTES: packaged with Postgres. Varies the number of thread up to NUMCPUS. The > workload is scaled so that the approximate size is 80% of of the database > shared buffer which itself is 20% of RAM. The page cache is not flushed > after the database is populated for the test and starts cache-hot. > MMTESTS CONFIG: global-dhp__db-pgbench-timed-rw-small-{ext4, xfs} > MEASURES: transactions per second > HIGHER is better > > EXT4 > 4.16.0 4.16.0 > vanilla hwp-boost > Hmean 1 2692.19 ( 0.00%) 2660.98 ( -1.16%) > Hmean 4 5218.93 ( 0.00%) 5610.10 ( 7.50%) > Hmean 7 7332.68 ( 0.00%) 8378.24 ( 14.26%) > Hmean 8 7462.03 ( 0.00%) 8713.36 ( 16.77%) > Stddev 1 231.85 ( 0.00%) 257.49 ( -11.06%) > Stddev 4 681.11 ( 0.00%) 312.64 ( 54.10%) > Stddev 7 1072.07 ( 0.00%) 730.29 ( 31.88%) > Stddev 8 1472.77 ( 0.00%) 1057.34 ( 28.21%) > > XFS > 4.16.0 4.16.0 > vanilla hwp-boost > Hmean 1 2675.02 ( 0.00%) 2661.69 ( -0.50%) > Hmean 4 5049.45 ( 0.00%) 5601.45 ( 10.93%) > Hmean 7 7302.18 ( 0.00%) 8348.16 ( 14.32%) > Hmean 8 7596.83 ( 0.00%) 8693.29 ( 14.43%) > Stddev 1 225.41 ( 0.00%) 246.74 ( -9.46%) > Stddev 4 761.33 ( 0.00%) 334.77 ( 56.03%) > Stddev 7 1093.93 ( 0.00%) 811.30 ( 25.84%) > Stddev 8 1465.06 ( 0.00%) 1118.81 ( 23.63%) > > HACKBENCH > ========= > > NOTES: Varies the number of groups between 1 and NUMCPUS*4 > MMTESTS CONFIG: global-dhp__scheduler-unbound > MEASURES: time (seconds) > LOWER is better > > 4.16.0 4.16.0 > vanilla hwp-boost > Amean 1 0.8350 ( 0.00%) 1.1577 ( -38.64%) > Amean 3 2.8367 ( 0.00%) 3.7457 ( -32.04%) > Amean 5 6.7503 ( 0.00%) 5.7977 ( 14.11%) > Amean 7 7.8290 ( 0.00%) 8.0343 ( -2.62%) > Amean 12 11.0560 ( 0.00%) 11.9673 ( -8.24%) > Amean 18 15.2603 ( 0.00%) 15.5247 ( -1.73%) > Amean 24 17.0283 ( 0.00%) 17.9047 ( -5.15%) > Amean 30 19.9193 ( 0.00%) 23.4670 ( -17.81%) > Amean 32 21.4637 ( 0.00%) 23.4097 ( -9.07%) > Stddev 1 0.0636 ( 0.00%) 0.0255 ( 59.93%) > Stddev 3 0.1188 ( 0.00%) 0.0235 ( 80.22%) > Stddev 5 0.0755 ( 0.00%) 0.1398 ( -85.13%) > Stddev 7 0.2778 ( 0.00%) 0.1634 ( 41.17%) > Stddev 12 0.5785 ( 0.00%) 0.1030 ( 82.19%) > Stddev 18 1.2099 ( 0.00%) 0.7986 ( 33.99%) > Stddev 24 0.2057 ( 0.00%) 0.7030 (-241.72%) > Stddev 30 1.1303 ( 0.00%) 0.7654 ( 32.28%) > Stddev 32 0.2032 ( 0.00%) 3.1626 (-1456.69%) > > NAS PARALLEL BENCHMARK, C-CLASS (w/ openMP) > =========================================== > > NOTES: The various computational kernels are run separately; see > https://www.nas.nasa.gov/publications/npb.html for the list of tasks (IS = > Integer Sort, EP = Embarrassingly Parallel, etc) > MMTESTS CONFIG: global-dhp__nas-c-class-omp-full > MEASURES: time (seconds) > LOWER is better > > 4.16.0 4.16.0 > vanilla hwp-boost > Amean bt.C 169.82 ( 0.00%) 170.54 ( -0.42%) > Stddev bt.C 1.07 ( 0.00%) 0.97 ( 9.34%) > Amean cg.C 41.81 ( 0.00%) 42.08 ( -0.65%) > Stddev cg.C 0.06 ( 0.00%) 0.03 ( 48.24%) > Amean ep.C 26.63 ( 0.00%) 26.47 ( 0.61%) > Stddev ep.C 0.37 ( 0.00%) 0.24 ( 35.35%) > Amean ft.C 38.17 ( 0.00%) 38.41 ( -0.64%) > Stddev ft.C 0.33 ( 0.00%) 0.32 ( 3.78%) > Amean is.C 1.49 ( 0.00%) 1.40 ( 6.02%) > Stddev is.C 0.20 ( 0.00%) 0.16 ( 19.40%) > Amean lu.C 217.46 ( 0.00%) 220.21 ( -1.26%) > Stddev lu.C 0.23 ( 0.00%) 0.22 ( 0.74%) > Amean mg.C 18.56 ( 0.00%) 18.80 ( -1.31%) > Stddev mg.C 0.01 ( 0.00%) 0.01 ( 22.54%) > Amean sp.C 293.25 ( 0.00%) 296.73 ( -1.19%) > Stddev sp.C 0.10 ( 0.00%) 0.06 ( 42.67%) > Amean ua.C 170.74 ( 0.00%) 172.02 ( -0.75%) > Stddev ua.C 0.28 ( 0.00%) 0.31 ( -12.89%) > > HOW TO REPRODUCE > ================ > > To install MMTests, clone the git repo at > https://github.com/gormanm/mmtests.git > > To run a config (ie a set of benchmarks, such as > config-global-dhp__nas-c-class-omp-full), use the command > ./run-mmtests.sh --config configs/$CONFIG $MNEMONIC-NAME > from the top-level directory; the benchmark source will be downloaded from its > canonical internet location, compiled and run. > > To compare results from two runs, use > ./bin/compare-mmtests.pl --directory ./work/log \ > --benchmark $BENCHMARK-NAME \ > --names $MNEMONIC-NAME-1,$MNEMONIC-NAME-2 > from the top-level directory. > > ================== > From RFC Series: > v3 > - Removed atomic bit operation as suggested. > - Added description of contention with user space. > - Removed hwp cache, boost utililty function patch and merged with util callback > patch. This way any value set is used somewhere. > > Waiting for test results from Mel Gorman, who is the original reporter. > > v2 > This is a much simpler version than the previous one and only consider IO > boost, using the existing mechanism. There is no change in this series > beyond intel_pstate driver. > > Once PeterZ finishes his work on frequency invariant, I will revisit > thread migration optimization in HWP mode. > > Other changes: > - Gradual boost instead of single step as suggested by PeterZ. > - Cross CPU synchronization concerns identified by Rafael. > - Split the patch for HWP MSR value caching as suggested by PeterZ. > > Not changed as suggested: > There is no architecture way to identify platform with Per-core > P-states, so still have to enable feature based on CPU model. > > ----------- > v1 > > This series tries to address some concern in performance particularly with IO > workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave > policy. > > Background > HWP performance can be controlled by user space using sysfs interface for > max/min frequency limits and energy performance preference settings. Based on > workload characteristics these can be adjusted from user space. These limits > are not changed dynamically by kernel based on workload. > > By default HWP defaults to energy performance preference value of 0x80 on > majority of platforms(Scale is 0-255, 0 is max performance and 255 is min). > This value offers best performance/watt and for majority of server workloads > performance doesn't suffer. Also users always have option to use performance > policy of intel_pstate, to get best performance. But user tend to run with > out of box configuration, which is powersave policy on most of the distros. > > In some case it is possible to dynamically adjust performance, for example, > when a CPU is woken up due to IO completion or thread migrate to a new CPU. In > this case HWP algorithm will take some time to build utilization and ramp up > P-states. So this may results in lower performance for some IO workloads and > workloads which tend to migrate. The idea of this patch series is to > temporarily boost performance dynamically in these cases. This is only > applicable only when user is using powersave policy, not in performance policy. > > Results on a Skylake server: > > Benchmark Improvement % > ---------------------------------------------------------------------- > dbench 50.36 > thread IO bench (tiobench) 10.35 > File IO 9.81 > sqlite 15.76 > X264 -104 cores 9.75 > > Spec Power (Negligible impact 7382 Vs. 7378) > Idle Power No change observed > ----------------------------------------------------------------------- > > HWP brings in best performace/watt at EPP=0x80. Since we are boosting > EPP here to 0, the performance/watt drops upto 10%. So there is a power > penalty of these changes. > > Also Mel Gorman provided test results on a prior patchset, which shows > benifits of this series. > > Srinivas Pandruvada (4): > cpufreq: intel_pstate: Add HWP boost utility and sched util hooks > cpufreq: intel_pstate: HWP boost performance on IO wakeup > cpufreq: intel_pstate: New sysfs entry to control HWP boost > cpufreq: intel_pstate: enable boost for Skylake Xeon > > drivers/cpufreq/intel_pstate.c | 179 ++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 176 insertions(+), 3 deletions(-) > > Applied, thanks!