From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92A2CC433EF for ; Mon, 14 Feb 2022 15:17:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1355060AbiBNPRe (ORCPT ); Mon, 14 Feb 2022 10:17:34 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:44582 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231397AbiBNPRd (ORCPT ); Mon, 14 Feb 2022 10:17:33 -0500 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 739B3A1B9; Mon, 14 Feb 2022 07:17:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1644851845; x=1676387845; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=O1BOYf38V75LZzTXPYUxz33r28t7Oav3prSpjt20VA4=; b=FzXZB/1sQwoerlXoCkWLbcY6FxOVOeGQnSphojopwTRKbFXSaun+QYhM y0LHt/mWlcc5gwfLHpUn6yeMuRxq2MkwEKb0WGyVN0fMgERYLGWTaoMWP UQ+qLTbK0QYK7ikNFWCUYG7DPMl1WdkqhHF/C9NyXuB2pPstmIj9XQ3QB LOSQ1ff+qxT0FpW/dwtjKh5YGWCddHanwo/8wT+jz2htc3wRB53tknvoN roirTIcSIqsmvZxPBpf/6e52+9YEq93ICrQx+fgNuvQOazIMr/1cRofc6 glXuKI/iCfSb7R32ZQ+CpqepBK/SK4y6jCt7qQSquGsoeo9KQ2mwZs8+B w==; X-IronPort-AV: E=McAfee;i="6200,9189,10257"; a="336535223" X-IronPort-AV: E=Sophos;i="5.88,368,1635231600"; d="scan'208";a="336535223" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2022 07:17:25 -0800 X-IronPort-AV: E=Sophos;i="5.88,368,1635231600"; d="scan'208";a="538662963" Received: from junliao-mobl.amr.corp.intel.com (HELO spandruv-desk1.amr.corp.intel.com) ([10.212.133.237]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2022 07:17:24 -0800 Message-ID: Subject: Re: CPU excessively long times between frequency scaling driver calls - bisected From: srinivas pandruvada To: Doug Smythies , "Zhang, Rui" Cc: "Tang, Feng" , Thomas Gleixner , "paulmck@kernel.org" , "stable@vger.kernel.org" , "x86@kernel.org" , "linux-pm@vger.kernel.org" Date: Mon, 14 Feb 2022 07:17:24 -0800 In-Reply-To: References: <003f01d81c8c$d20ee3e0$762caba0$@telus.net> <20220208023940.GA5558@shbuild999.sh.intel.com> <20220208091525.GA7898@shbuild999.sh.intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.3 (3.42.3-1.fc35) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org Hi Doug, I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity. Thanks, Srinivas On Sun, 2022-02-13 at 10:54 -0800, Doug Smythies wrote: > Hi Rui, > > On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui > wrote: > > On 2022.02.09 14:23 Doug wrote: > > > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang > > > wrote: > > > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote: > > > > > > > > > > > > > > Since kernel 5.16-rc4 and commit: > > > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e > > > > > > > " x86/tsc: Disable clocksource watchdog for TSC on > > > > > > > qualified platorms" > > > > > > > > > > > > > > There are now occasions where times between calls to the > > > > > > > driver > > > > > > > can be over 100's of seconds and can result in the CPU > > > > > > > frequency > > > > > > > being left unnecessarily high for extended periods. > > > > > > > > > > > > > > From the number of clock cycles executed between these > > > > > > > long > > > > > > > durations one can tell that the CPU has been running > > > > > > > code, but > > > > > > > the driver never got called. > > > > > > > > > > > > > > Attached are some graphs from some trace data acquired > > > > > > > using > > > > > > > intel_pstate_tracer.py where one can observe an idle > > > > > > > system > > > > > > > between about 42 and well over 200 seconds elapsed time, > > > > > > > yet > > > > > > > CPU10 never gets called, which would have resulted in > > > > > > > reducing > > > > > > > it's pstate request, until an elapsed time of 167.616 > > > > > > > seconds, > > > > > > > 126 seconds since the last call. The CPU frequency never > > > > > > > does go to > > > minimum. > > > > > > > > > > > > > > For reference, a similar CPU frequency graph is also > > > > > > > attached, > > > > > > > with the commit reverted. The CPU frequency drops to > > > > > > > minimum, > > > > > > > over about 10 or 15 seconds., > > > > > > > > > > > > commit b50db7095fe0 essentially disables the clocksource > > > > > > watchdog, > > > > > > which literally doesn't have much to do with cpufreq code. > > > > > > > > > > > > One thing I can think of is, without the patch, there is a > > > > > > periodic clocksource timer running every 500 ms, and it > > > > > > loops to > > > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from > > > > > > the > > > > > > graph), so each CPU will get a timer (HW timer interrupt > > > > > > backed) > > > > > > every 6 seconds. Could this affect the cpufreq governor's > > > > > > work > > > > > > flow (I just quickly read some cpufreq code, and seem there > > > > > > is > > > > > > irq_work/workqueue involved). > > > > > > > > > > 6 Seconds is the longest duration I have ever seen on this > > > > > processor > > > > > before commit b50db7095fe0. > > > > > > > > > > I said "the times between calls to the driver have never > > > > > exceeded 10 > > > > > seconds" originally, but that involved other processors. > > > > > > > > > > I also did longer, 9000 second tests: > > > > > > > > > > For a reverted kernel the driver was called 131,743, and 0 > > > > > times the > > > > > duration was longer than 6.1 seconds. > > > > > > > > > > For a non-reverted kernel the driver was called 110,241 > > > > > times, and > > > > > 1397 times the duration was longer than 6.1 seconds, and the > > > > > maximum > > > > > duration was 303.6 seconds > > > > > > > > Thanks for the data, which shows it is related to the removal > > > > of > > > > clocksource watchdog timers. And under this specific > > > > configurations, > > > > the cpufreq work flow has some dependence on that watchdog > > > > timers. > > > > > > > > Also could you share you kernel config, boot message and some > > > > system > > > > settings like for tickless mode, so that other people can try > > > > to > > > > reproduce? thanks > > > > > > I steal the kernel configuration file from the Ubuntu mainline > > > PPA [1], what > > > they call "lowlatency", or 1000Hz tick. I make these changes > > > before compile: > > > > > > scripts/config --disable DEBUG_INFO > > > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config -- > > > disable > > > SYSTEM_REVOCATION_KEYS > > > > > > I also send you the config and dmesg files in an off-list email. > > > > > > This is an idle, and very low periodic loads, system type test. > > > My test computer has no GUI and very few services running. > > > Notice that I have not used the word "regression" yet in this > > > thread, because > > > I don't know for certain that it is. In the end, we don't care > > > about CPU > > > frequency, we care about wasting energy. > > > It is definitely a change, and I am able to measure small > > > increases in energy > > > use, but this is all at the low end of the power curve. > > Please keep the above statement in mind. The differences were rather > minor, > Since Rui asks for data below, I have tried to find better examples. > > > What do you use to measure the energy use? And what difference do > > you observe? > > I use both turbostat and a processor power monitoring tool of my own. > Don't get me wrong, I love turbostat, but it has become somewhat > heavy > (lots of code per interval) over recent years. My own utility has > minimal > code per interval, only uses pre-allocated memory with no disk or > terminal > interaction during the sampling phase, resulting in minimal system > perturbation > due to itself. > > > > So far I have not found a significant example of increased power > > > use, but I > > > also have not looked very hard. > > I looked a little harder for this reply, searching all single CPU > loads for a few > work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work > packet > per work interval. > > > > During any test, many monitoring tools might shorten durations. > > > For example if I run turbostat, say: > > > > > > sudo turbostat --Summary --quiet --show > > > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- > > > interval > > > 2.5 > > > > > > Well, yes then the maximum duration would be 2.5 seconds, because > > > turbostat wakes up each CPU to inquire about things causing a > > > call to the CPU > > > scaling driver. (I tested this, for about > > > 900 seconds.) > > > > > > For my power tests I use a sample interval of >= 300 seconds. > > > > So you use something like "turbostat sleep 900" for power test, > > Typically I use 300 seconds for turbostat for this work. > It was only the other day, that I saw a duration of 302 seconds, so > yes I should try even longer, but it is becoming a tradeoff between > experiments taking many many hours and time available. > > > and the RAPL Energy counters show the power difference? > > Yes. > > > Can you paste the turbostat output both w/ and w/o the watchdog? > > Workflow: > > Task 1: Purpose: main periodic load. > > 411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz, > max_turbo (it would limit out at 100% duty cycle at about pstate 13) > > Note: this is a higher load than I was using when I started this > email thread. > > Task 2: Purpose: To assist in promoting manifestation of the > issue of potential concern with commit b50db7095fe0. > > A short burst of work (about 30 milliseconds @ max turbo, > longer at lower frequencies), then sleep for 45 seconds. > Say, almost 0 load at 0.022 hertz. > Greater than or equal to 30 milliseconds at full load, ensures > that the intel_pstate driver will be called at least 2 or 3 times, > raising the requested pstate for that CPU. > > Task 3: Purpose: Acquire power data with minimum system > perturbation due to this very monitoring task. > > Both turbostat and my own power monitor were used, never > at the same time (unless just to prove they reported the same > power). This turbostat command: > > turbostat --Summary --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval > 300 > > and my program also sampled at 300 second per sample, > typically 5 samples, and after 1/2 hour since boot. > > Kernels: > stock: 5.17-rc3 as is. > revert: 5.17-rc3 with b50db7095fe0 reverted. > > Test 1: no-hwp/active/powersave: > > Doug's cpu_power_mon: > > Stock: 5 samples @ 300 seconds per sample: > average: 4.7 watts +9% > minimum: 4.3 watts > maximum: 4.9 watts > > Revert: 5 samples @ 300 seconds per sample: > average: 4.3 watts > minimum: 4.2 watts > maximum: 4.5 watts > > Test 2: no-hwp/passive/schedutil: > > In the beginning active/powersave was used > because it is the easiest (at least for me) to > understand and interpret the intel_pstate_tracer.py > results. Long durations are common in passive mode, > because something higher up can decide not to call the driver > because nothing changed. Very valuable lost information > in my opinion. > > I didn't figure it out for awhile, but schedutil is bi-stable > with this workflow, depending on if it approaches steady > state from a higher or lower previous load (i.e. hysteresis). > With either kernel it might run for hours and hours at an > average of 6.1 watts or 4.9 watts. This difference dominates, > so trying to reveal if there is any contribution from the commit > of this thread is useless. > > Test 3: Similar workflow as test1, but 347 Hertz and > a little less work per work period. > > Task 2 was changed to use more CPUs to try > to potentially amplify manifestation of the effect. > >  Doug's cpu_power_mon: > > Stock: 5 samples @ 300 seconds per sample: > average: 4.2 watts +31% > minimum: 3.5 watts > maximum: 4.9 watts > > Revert: 5 samples @ 300 seconds per sample: > average: 3.2 watts > minimum: 3.1 watts > maximum: 3.2 watts > > Re-do test 1, but with the improved task 2: > > Turbostat: > > Stock: > > doug@s19:~$ sudo > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary > --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt > --interval 300 > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt > 4.09    3335    282024  36      5.79    5.11    0.00    0.89 > 4.11    3449    283286  34      6.06    5.40    0.00    0.89 > 4.11    3504    284532  35      6.35    5.70    0.00    0.89 > 4.11    3493    283908  35      6.26    5.60    0.00    0.89 > 4.11    3499    284243  34      6.27    5.62    0.00    0.89 > > Revert: > > doug@s19:~/freq-scalers/long_dur$ sudo > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary > --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt > --interval 300 > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt > 4.12    3018    283114  34      4.57    3.91    0.00    0.89 > 4.12    3023    283482  34      4.59    3.94    0.00    0.89 > 4.12    3017    282609  34      4.71    4.05    0.00    0.89 > 4.12    3013    282898  34      4.64    3.99    0.00    0.89 > 4.12    3013    282956  36      4.56    3.90    0.00    0.89 > 4.12    3026    282807  34      4.61    3.95    0.00    0.89 > 4.12    3026    282738  35      4.50    3.84    0.00    0.89 > > Or average of 6.2 watts versus 4.6 watts, +35%. > > Several other tests had undetectable power differences > between kernels, but I did not go back and re-test with > the improved task 2. > > > > > Thanks, > > rui > > > > > For duration only tests, turbostat is not run at the same time. > > > > > > My grub line: > > > > > > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 > > > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on > > > cpuidle.governor=teo" > > > > > > A typical pstate tracer command (with the script copied to the > > > directory > > > where I run this stuff:): > > > > > > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 -- > > > memory > > > 800000 > > > > > > > > > > > > > Can you try one test that keep all the current setting and > > > > > > change > > > > > > the irq affinity of disk/network-card to 0xfff to let > > > > > > interrupts > > > > > > from them be distributed to all CPUs? > > > > > > > > > > I am willing to do the test, but I do not know how to change > > > > > the irq > > > > > affinity. > > > > > > > > I might say that too soon. I used to "echo fff > > > > > /proc/irq/xxx/smp_affinity" > > > > (xx is the irq number of a device) to let interrupts be > > > > distributed to > > > > all CPUs long time ago, but it doesn't work on my 2 desktops at > > > > hand. > > > > Seems it only support one-cpu irq affinity in recent kernel. > > > > > > > > You can still try that command, though it may not work. > > > > > > I did not try this yet. > > > > > > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/