CPU excessively long times between frequency scaling driver calls

All of lore.kernel.org
 help / color / mirror / Atom feed

* CPU excessively long times between frequency scaling driver calls - bisected
@ 2022-02-08  1:40 Doug Smythies
  2022-02-08  2:39 ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-08  1:40 UTC (permalink / raw)
  To: 'Feng Tang', 'Thomas Gleixner'
  Cc: paulmck, stable, x86, linux-pm, 'srinivas pandruvada',
	Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 1577 bytes --]

Hi All,

Note before: I do not know If I have the e-mail address list correct,
nor am I actually a member of the x86 distribution list. I am on
the linux-pm email list.

When using the intel_pstate CPU frequency scaling driver with HWP disabled,
active mode, powersave scaling governor, the times between calls to the driver
have never exceeded 10 seconds.

Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
" x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be
over 100's of seconds and can result in the CPU frequency being left
unnecessarily high for extended periods.

From the number of clock cycles executed between these long
durations one can tell that the CPU has been running code, but
the driver never got called.

Attached are some graphs from some trace data acquired using
intel_pstate_tracer.py where one can observe an idle system between
about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
called, which would have resulted in reducing it's pstate request, until
an elapsed time of 167.616 seconds, 126 seconds since the last call. The
CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with
the commit reverted. The CPU frequency drops to minimum,
over about 10 or 15 seconds.

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

Why this particular configuration, i.e. no-hwp, active, powersave?
Because it is, by far, the easiest to observe what is going on.

... Doug

[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 110376 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-08  1:40 CPU excessively long times between frequency scaling driver calls - bisected Doug Smythies
@ 2022-02-08  2:39 ` Feng Tang
  2022-02-08  7:13   ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-08  2:39 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Thomas Gleixner',
	paulmck, stable, x86, linux-pm, 'srinivas pandruvada'

Hi Doug,

Thanks for the report.

On Tue, Feb 08, 2022 at 09:40:14AM +0800, Doug Smythies wrote:
> Hi All,
> 
> Note before: I do not know If I have the e-mail address list correct,
> nor am I actually a member of the x86 distribution list. I am on
> the linux-pm email list.
> 
> When using the intel_pstate CPU frequency scaling driver with HWP disabled,
> active mode, powersave scaling governor, the times between calls to the driver
> have never exceeded 10 seconds.
> 
> Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
> " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> 
> There are now occasions where times between calls to the driver can be
> over 100's of seconds and can result in the CPU frequency being left
> unnecessarily high for extended periods.
> 
> >From the number of clock cycles executed between these long
> durations one can tell that the CPU has been running code, but
> the driver never got called.
> 
> Attached are some graphs from some trace data acquired using
> intel_pstate_tracer.py where one can observe an idle system between
> about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
> called, which would have resulted in reducing it's pstate request, until
> an elapsed time of 167.616 seconds, 126 seconds since the last call. The
> CPU frequency never does go to minimum.
> 
> For reference, a similar CPU frequency graph is also attached, with
> the commit reverted. The CPU frequency drops to minimum,
> over about 10 or 15 seconds.


commit b50db7095fe0 essentially disables the clocksource watchdog, 
which literally doesn't have much to do with cpufreq code. 

One thing I can think of is, without the patch, there is a periodic
clocksource timer running every 500 ms, and it loops to run on
all CPUs in turn. For your HW, it has 12 CPUs (from the graph),
so each CPU will get a timer (HW timer interrupt backed) every 6
seconds. Could this affect the cpufreq governor's work flow (I just
quickly read some cpufreq code, and seem there is irq_work/workqueue
involved).

Can you try one test that keep all the current setting and change
the irq affinity of disk/network-card to 0xfff to let interrupts
from them be distributed to all CPUs?

Thanks,
Feng


> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> 
> Why this particular configuration, i.e. no-hwp, active, powersave?
> Because it is, by far, the easiest to observe what is going on.
> 
> ... Doug
> 






^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-08  2:39 ` Feng Tang
@ 2022-02-08  7:13   ` Doug Smythies
  2022-02-08  9:15     ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-08  7:13 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, paulmck, stable, x86, linux-pm,
	srinivas pandruvada, dsmythies

Hi Feng,

Thank you for your reply.

On Mon, Feb 7, 2022 at 6:39 PM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Doug,
>
> Thanks for the report.
>
> On Tue, Feb 08, 2022 at 09:40:14AM +0800, Doug Smythies wrote:
> > Hi All,
> >
> > Note before: I do not know If I have the e-mail address list correct,
> > nor am I actually a member of the x86 distribution list. I am on
> > the linux-pm email list.
> >
> > When using the intel_pstate CPU frequency scaling driver with HWP disabled,
> > active mode, powersave scaling governor, the times between calls to the driver
> > have never exceeded 10 seconds.
> >
> > Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
> > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> >
> > There are now occasions where times between calls to the driver can be
> > over 100's of seconds and can result in the CPU frequency being left
> > unnecessarily high for extended periods.
> >
> > From the number of clock cycles executed between these long
> > durations one can tell that the CPU has been running code, but
> > the driver never got called.
> >
> > Attached are some graphs from some trace data acquired using
> > intel_pstate_tracer.py where one can observe an idle system between
> > about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
> > called, which would have resulted in reducing it's pstate request, until
> > an elapsed time of 167.616 seconds, 126 seconds since the last call. The
> > CPU frequency never does go to minimum.
> >
> > For reference, a similar CPU frequency graph is also attached, with
> > the commit reverted. The CPU frequency drops to minimum,
> > over about 10 or 15 seconds.,
>
> commit b50db7095fe0 essentially disables the clocksource watchdog,
> which literally doesn't have much to do with cpufreq code.
>
> One thing I can think of is, without the patch, there is a periodic
> clocksource timer running every 500 ms, and it loops to run on
> all CPUs in turn. For your HW, it has 12 CPUs (from the graph),
> so each CPU will get a timer (HW timer interrupt backed) every 6
> seconds. Could this affect the cpufreq governor's work flow (I just
> quickly read some cpufreq code, and seem there is irq_work/workqueue
> involved).

6 Seconds is the longest duration I have ever seen on this
processor before commit b50db7095fe0.

I said "the times between calls to the driver have never
exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743,
and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times,
and 1397 times the duration was longer than 6.1 seconds,
and the maximum duration was 303.6 seconds

> Can you try one test that keep all the current setting and change
> the irq affinity of disk/network-card to 0xfff to let interrupts
> from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the
irq affinity.

> Thanks,
> Feng
>
>
> > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> >
> > Why this particular configuration, i.e. no-hwp, active, powersave?
> > Because it is, by far, the easiest to observe what is going on.
> >
> > ... Doug
> >
>
>
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-08  7:13   ` Doug Smythies
@ 2022-02-08  9:15     ` Feng Tang
  2022-02-09  6:23       ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-08  9:15 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Thomas Gleixner, paulmck, stable, x86, linux-pm, srinivas pandruvada

On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > >
> > > Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > >
> > > There are now occasions where times between calls to the driver can be
> > > over 100's of seconds and can result in the CPU frequency being left
> > > unnecessarily high for extended periods.
> > >
> > > From the number of clock cycles executed between these long
> > > durations one can tell that the CPU has been running code, but
> > > the driver never got called.
> > >
> > > Attached are some graphs from some trace data acquired using
> > > intel_pstate_tracer.py where one can observe an idle system between
> > > about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
> > > called, which would have resulted in reducing it's pstate request, until
> > > an elapsed time of 167.616 seconds, 126 seconds since the last call. The
> > > CPU frequency never does go to minimum.
> > >
> > > For reference, a similar CPU frequency graph is also attached, with
> > > the commit reverted. The CPU frequency drops to minimum,
> > > over about 10 or 15 seconds.,
> >
> > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > which literally doesn't have much to do with cpufreq code.
> >
> > One thing I can think of is, without the patch, there is a periodic
> > clocksource timer running every 500 ms, and it loops to run on
> > all CPUs in turn. For your HW, it has 12 CPUs (from the graph),
> > so each CPU will get a timer (HW timer interrupt backed) every 6
> > seconds. Could this affect the cpufreq governor's work flow (I just
> > quickly read some cpufreq code, and seem there is irq_work/workqueue
> > involved).
> 
> 6 Seconds is the longest duration I have ever seen on this
> processor before commit b50db7095fe0.
> 
> I said "the times between calls to the driver have never
> exceeded 10 seconds" originally, but that involved other processors.
> 
> I also did longer, 9000 second tests:
> 
> For a reverted kernel the driver was called 131,743,
> and 0 times the duration was longer than 6.1 seconds.
> 
> For a non-reverted kernel the driver was called 110,241 times,
> and 1397 times the duration was longer than 6.1 seconds,
> and the maximum duration was 303.6 seconds
 
Thanks for the data, which shows it is related to the removal of
clocksource watchdog timers. And under this specific configurations,
the cpufreq work flow has some dependence on that watchdog timers.  

Also could you share you kernel config, boot message and some
system settings like for tickless mode, so that other people can
try to reproduce? thanks

> > Can you try one test that keep all the current setting and change
> > the irq affinity of disk/network-card to 0xfff to let interrupts
> > from them be distributed to all CPUs?
> 
> I am willing to do the test, but I do not know how to change the
> irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
(xx is the irq number of a device) to let interrupts be distributed
to all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-08  9:15     ` Feng Tang
@ 2022-02-09  6:23       ` Doug Smythies
  2022-02-10  7:45         ` Zhang, Rui
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-09  6:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, paulmck, stable, x86, linux-pm,
	srinivas pandruvada, dsmythies

On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@intel.com> wrote:
> On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > >
> > > > Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > >
> > > > There are now occasions where times between calls to the driver can be
> > > > over 100's of seconds and can result in the CPU frequency being left
> > > > unnecessarily high for extended periods.
> > > >
> > > > From the number of clock cycles executed between these long
> > > > durations one can tell that the CPU has been running code, but
> > > > the driver never got called.
> > > >
> > > > Attached are some graphs from some trace data acquired using
> > > > intel_pstate_tracer.py where one can observe an idle system between
> > > > about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
> > > > called, which would have resulted in reducing it's pstate request, until
> > > > an elapsed time of 167.616 seconds, 126 seconds since the last call. The
> > > > CPU frequency never does go to minimum.
> > > >
> > > > For reference, a similar CPU frequency graph is also attached, with
> > > > the commit reverted. The CPU frequency drops to minimum,
> > > > over about 10 or 15 seconds.,
> > >
> > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > which literally doesn't have much to do with cpufreq code.
> > >
> > > One thing I can think of is, without the patch, there is a periodic
> > > clocksource timer running every 500 ms, and it loops to run on
> > > all CPUs in turn. For your HW, it has 12 CPUs (from the graph),
> > > so each CPU will get a timer (HW timer interrupt backed) every 6
> > > seconds. Could this affect the cpufreq governor's work flow (I just
> > > quickly read some cpufreq code, and seem there is irq_work/workqueue
> > > involved).
> >
> > 6 Seconds is the longest duration I have ever seen on this
> > processor before commit b50db7095fe0.
> >
> > I said "the times between calls to the driver have never
> > exceeded 10 seconds" originally, but that involved other processors.
> >
> > I also did longer, 9000 second tests:
> >
> > For a reverted kernel the driver was called 131,743,
> > and 0 times the duration was longer than 6.1 seconds.
> >
> > For a non-reverted kernel the driver was called 110,241 times,
> > and 1397 times the duration was longer than 6.1 seconds,
> > and the maximum duration was 303.6 seconds
>
> Thanks for the data, which shows it is related to the removal of
> clocksource watchdog timers. And under this specific configurations,
> the cpufreq work flow has some dependence on that watchdog timers.
>
> Also could you share you kernel config, boot message and some
> system settings like for tickless mode, so that other people can
> try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA
[1], what they call "lowlatency", or 1000Hz tick. I make these
changes before compile:

scripts/config --disable DEBUG_INFO
scripts/config --disable SYSTEM_TRUSTED_KEYS
scripts/config --disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test.
My test computer has no GUI and very few services running.
Notice that I have not used the word "regression" yet in this thread,
because I don't know for certain that it is. In the end, we don't
care about CPU frequency, we care about wasting energy.
It is definitely a change, and I am able to measure small increases
in energy use, but this is all at the low end of the power curve.
So far I have not found a significant example of increased power
use, but I also have not looked very hard.

During any test, many monitoring tools might shorten durations.
For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval
2.5

Well, yes then the maximum duration would be 2.5 seconds,
because turbostat wakes up each CPU to inquire about things
causing a call to the CPU scaling driver. (I tested this, for about
900 seconds.)

For my power tests I use a sample interval of >= 300 seconds.
For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the
directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000

>
> > > Can you try one test that keep all the current setting and change
> > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > from them be distributed to all CPUs?
> >
> > I am willing to do the test, but I do not know how to change the
> > irq affinity.
>
> I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> (xx is the irq number of a device) to let interrupts be distributed
> to all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> Seems it only support one-cpu irq affinity in recent kernel.
>
> You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-09  6:23       ` Doug Smythies
@ 2022-02-10  7:45         ` Zhang, Rui
  2022-02-13 18:54           ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang, Rui @ 2022-02-10  7:45 UTC (permalink / raw)
  To: Doug Smythies, Tang, Feng
  Cc: Thomas Gleixner, paulmck, stable, x86, linux-pm, srinivas pandruvada



> -----Original Message-----
> From: Doug Smythies <dsmythies@telus.net>
> Sent: Wednesday, February 09, 2022 2:23 PM
> To: Tang, Feng <feng.tang@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>; paulmck@kernel.org;
> stable@vger.kernel.org; x86@kernel.org; linux-pm@vger.kernel.org; srinivas
> pandruvada <srinivas.pandruvada@linux.intel.com>; dsmythies
> <dsmythies@telus.net>
> Subject: Re: CPU excessively long times between frequency scaling driver
> calls - bisected
> 
> On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@intel.com> wrote:
> > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > > >
> > > > > Since kernel 5.16-rc4 and commit:
> > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > > >
> > > > > There are now occasions where times between calls to the driver
> > > > > can be over 100's of seconds and can result in the CPU frequency
> > > > > being left unnecessarily high for extended periods.
> > > > >
> > > > > From the number of clock cycles executed between these long
> > > > > durations one can tell that the CPU has been running code, but
> > > > > the driver never got called.
> > > > >
> > > > > Attached are some graphs from some trace data acquired using
> > > > > intel_pstate_tracer.py where one can observe an idle system
> > > > > between about 42 and well over 200 seconds elapsed time, yet
> > > > > CPU10 never gets called, which would have resulted in reducing
> > > > > it's pstate request, until an elapsed time of 167.616 seconds,
> > > > > 126 seconds since the last call. The CPU frequency never does go to
> minimum.
> > > > >
> > > > > For reference, a similar CPU frequency graph is also attached,
> > > > > with the commit reverted. The CPU frequency drops to minimum,
> > > > > over about 10 or 15 seconds.,
> > > >
> > > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > > which literally doesn't have much to do with cpufreq code.
> > > >
> > > > One thing I can think of is, without the patch, there is a
> > > > periodic clocksource timer running every 500 ms, and it loops to
> > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from the
> > > > graph), so each CPU will get a timer (HW timer interrupt backed)
> > > > every 6 seconds. Could this affect the cpufreq governor's work
> > > > flow (I just quickly read some cpufreq code, and seem there is
> > > > irq_work/workqueue involved).
> > >
> > > 6 Seconds is the longest duration I have ever seen on this processor
> > > before commit b50db7095fe0.
> > >
> > > I said "the times between calls to the driver have never exceeded 10
> > > seconds" originally, but that involved other processors.
> > >
> > > I also did longer, 9000 second tests:
> > >
> > > For a reverted kernel the driver was called 131,743, and 0 times the
> > > duration was longer than 6.1 seconds.
> > >
> > > For a non-reverted kernel the driver was called 110,241 times, and
> > > 1397 times the duration was longer than 6.1 seconds, and the maximum
> > > duration was 303.6 seconds
> >
> > Thanks for the data, which shows it is related to the removal of
> > clocksource watchdog timers. And under this specific configurations,
> > the cpufreq work flow has some dependence on that watchdog timers.
> >
> > Also could you share you kernel config, boot message and some system
> > settings like for tickless mode, so that other people can try to
> > reproduce? thanks
> 
> I steal the kernel configuration file from the Ubuntu mainline PPA [1], what
> they call "lowlatency", or 1000Hz tick. I make these changes before compile:
> 
> scripts/config --disable DEBUG_INFO
> scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable
> SYSTEM_REVOCATION_KEYS
> 
> I also send you the config and dmesg files in an off-list email.
> 
> This is an idle, and very low periodic loads, system type test.
> My test computer has no GUI and very few services running.
> Notice that I have not used the word "regression" yet in this thread, because
> I don't know for certain that it is. In the end, we don't care about CPU
> frequency, we care about wasting energy.
> It is definitely a change, and I am able to measure small increases in energy
> use, but this is all at the low end of the power curve.

What do you use to measure the energy use? And what difference do you observe?

> So far I have not found a significant example of increased power use, but I
> also have not looked very hard.
> 
> During any test, many monitoring tools might shorten durations.
> For example if I run turbostat, say:
> 
> sudo turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
> interval
> 2.5
> 
> Well, yes then the maximum duration would be 2.5 seconds, because
> turbostat wakes up each CPU to inquire about things causing a call to the CPU
> scaling driver. (I tested this, for about
> 900 seconds.)
> 
> For my power tests I use a sample interval of >= 300 seconds.

So you use something like "turbostat sleep 900" for power test, and the RAPL Energy counters show the power difference?
Can you paste the turbostat output both w/ and w/o the watchdog?

Thanks,
rui

> For duration only tests, turbostat is not run at the same time.
> 
> My grub line:
> 
> GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
> intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
> cpuidle.governor=teo"
> 
> A typical pstate tracer command (with the script copied to the directory
> where I run this stuff:):
> 
> sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory
> 800000
> 
> >
> > > > Can you try one test that keep all the current setting and change
> > > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > > from them be distributed to all CPUs?
> > >
> > > I am willing to do the test, but I do not know how to change the irq
> > > affinity.
> >
> > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> > (xx is the irq number of a device) to let interrupts be distributed to
> > all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> > Seems it only support one-cpu irq affinity in recent kernel.
> >
> > You can still try that command, though it may not work.
> 
> I did not try this yet.
> 
> [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-10  7:45         ` Zhang, Rui
@ 2022-02-13 18:54           ` Doug Smythies
  2022-02-14 15:17             ` srinivas pandruvada
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-13 18:54 UTC (permalink / raw)
  To: Zhang, Rui
  Cc: Tang, Feng, Thomas Gleixner, paulmck, stable, x86, linux-pm,
	srinivas pandruvada, dsmythies

Hi Rui,

On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui <rui.zhang@intel.com> wrote:
> On 2022.02.09 14:23 Doug wrote:
> > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@intel.com> wrote:
> > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > > > >
> > > > > > Since kernel 5.16-rc4 and commit:
> > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > > > >
> > > > > > There are now occasions where times between calls to the driver
> > > > > > can be over 100's of seconds and can result in the CPU frequency
> > > > > > being left unnecessarily high for extended periods.
> > > > > >
> > > > > > From the number of clock cycles executed between these long
> > > > > > durations one can tell that the CPU has been running code, but
> > > > > > the driver never got called.
> > > > > >
> > > > > > Attached are some graphs from some trace data acquired using
> > > > > > intel_pstate_tracer.py where one can observe an idle system
> > > > > > between about 42 and well over 200 seconds elapsed time, yet
> > > > > > CPU10 never gets called, which would have resulted in reducing
> > > > > > it's pstate request, until an elapsed time of 167.616 seconds,
> > > > > > 126 seconds since the last call. The CPU frequency never does go to
> > minimum.
> > > > > >
> > > > > > For reference, a similar CPU frequency graph is also attached,
> > > > > > with the commit reverted. The CPU frequency drops to minimum,
> > > > > > over about 10 or 15 seconds.,
> > > > >
> > > > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > > > which literally doesn't have much to do with cpufreq code.
> > > > >
> > > > > One thing I can think of is, without the patch, there is a
> > > > > periodic clocksource timer running every 500 ms, and it loops to
> > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from the
> > > > > graph), so each CPU will get a timer (HW timer interrupt backed)
> > > > > every 6 seconds. Could this affect the cpufreq governor's work
> > > > > flow (I just quickly read some cpufreq code, and seem there is
> > > > > irq_work/workqueue involved).
> > > >
> > > > 6 Seconds is the longest duration I have ever seen on this processor
> > > > before commit b50db7095fe0.
> > > >
> > > > I said "the times between calls to the driver have never exceeded 10
> > > > seconds" originally, but that involved other processors.
> > > >
> > > > I also did longer, 9000 second tests:
> > > >
> > > > For a reverted kernel the driver was called 131,743, and 0 times the
> > > > duration was longer than 6.1 seconds.
> > > >
> > > > For a non-reverted kernel the driver was called 110,241 times, and
> > > > 1397 times the duration was longer than 6.1 seconds, and the maximum
> > > > duration was 303.6 seconds
> > >
> > > Thanks for the data, which shows it is related to the removal of
> > > clocksource watchdog timers. And under this specific configurations,
> > > the cpufreq work flow has some dependence on that watchdog timers.
> > >
> > > Also could you share you kernel config, boot message and some system
> > > settings like for tickless mode, so that other people can try to
> > > reproduce? thanks
> >
> > I steal the kernel configuration file from the Ubuntu mainline PPA [1], what
> > they call "lowlatency", or 1000Hz tick. I make these changes before compile:
> >
> > scripts/config --disable DEBUG_INFO
> > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable
> > SYSTEM_REVOCATION_KEYS
> >
> > I also send you the config and dmesg files in an off-list email.
> >
> > This is an idle, and very low periodic loads, system type test.
> > My test computer has no GUI and very few services running.
> > Notice that I have not used the word "regression" yet in this thread, because
> > I don't know for certain that it is. In the end, we don't care about CPU
> > frequency, we care about wasting energy.
> > It is definitely a change, and I am able to measure small increases in energy
> > use, but this is all at the low end of the power curve.

Please keep the above statement in mind. The differences were rather minor,
Since Rui asks for data below, I have tried to find better examples.

> What do you use to measure the energy use? And what difference do you observe?

I use both turbostat and a processor power monitoring tool of my own.
Don't get me wrong, I love turbostat, but it has become somewhat heavy
(lots of code per interval) over recent years. My own utility has minimal
code per interval, only uses pre-allocated memory with no disk or terminal
interaction during the sampling phase, resulting in minimal system perturbation
due to itself.

> > So far I have not found a significant example of increased power use, but I
> > also have not looked very hard.

I looked a little harder for this reply, searching all single CPU
loads for a few
work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work packet
per work interval.

> > During any test, many monitoring tools might shorten durations.
> > For example if I run turbostat, say:
> >
> > sudo turbostat --Summary --quiet --show
> > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
> > interval
> > 2.5
> >
> > Well, yes then the maximum duration would be 2.5 seconds, because
> > turbostat wakes up each CPU to inquire about things causing a call to the CPU
> > scaling driver. (I tested this, for about
> > 900 seconds.)
> >
> > For my power tests I use a sample interval of >= 300 seconds.
>
> So you use something like "turbostat sleep 900" for power test,

Typically I use 300 seconds for turbostat for this work.
It was only the other day, that I saw a duration of 302 seconds, so
yes I should try even longer, but it is becoming a tradeoff between
experiments taking many many hours and time available.

> and the RAPL Energy counters show the power difference?

Yes.

> Can you paste the turbostat output both w/ and w/o the watchdog?

Workflow:

Task 1: Purpose: main periodic load.

411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz,
max_turbo (it would limit out at 100% duty cycle at about pstate 13)

Note: this is a higher load than I was using when I started this
email thread.

Task 2: Purpose: To assist in promoting manifestation of the
issue of potential concern with commit b50db7095fe0.

A short burst of work (about 30 milliseconds @ max turbo,
longer at lower frequencies), then sleep for 45 seconds.
Say, almost 0 load at 0.022 hertz.
Greater than or equal to 30 milliseconds at full load, ensures
that the intel_pstate driver will be called at least 2 or 3 times,
raising the requested pstate for that CPU.

Task 3: Purpose: Acquire power data with minimum system
perturbation due to this very monitoring task.

Both turbostat and my own power monitor were used, never
at the same time (unless just to prove they reported the same
power). This turbostat command:

turbostat --Summary --quiet --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval
300

and my program also sampled at 300 second per sample,
typically 5 samples, and after 1/2 hour since boot.

Kernels:
stock: 5.17-rc3 as is.
revert: 5.17-rc3 with b50db7095fe0 reverted.

Test 1: no-hwp/active/powersave:

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample:
average: 4.7 watts +9%
minimum: 4.3 watts
maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample:
average: 4.3 watts
minimum: 4.2 watts
maximum: 4.5 watts

Test 2: no-hwp/passive/schedutil:

In the beginning active/powersave was used
because it is the easiest (at least for me) to
understand and interpret the intel_pstate_tracer.py
results. Long durations are common in passive mode,
because something higher up can decide not to call the driver
because nothing changed. Very valuable lost information
in my opinion.

I didn't figure it out for awhile, but schedutil is bi-stable
with this workflow, depending on if it approaches steady
state from a higher or lower previous load (i.e. hysteresis).
With either kernel it might run for hours and hours at an
average of 6.1 watts or 4.9 watts. This difference dominates,
so trying to reveal if there is any contribution from the commit
of this thread is useless.

Test 3: Similar workflow as test1, but 347 Hertz and
a little less work per work period.

Task 2 was changed to use more CPUs to try
to potentially amplify manifestation of the effect.

 Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample:
average: 4.2 watts +31%
minimum: 3.5 watts
maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample:
average: 3.2 watts
minimum: 3.1 watts
maximum: 3.2 watts

Re-do test 1, but with the improved task 2:

Turbostat:

Stock:

doug@s19:~$ sudo
/home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
--quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
--interval 300
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
4.09    3335    282024  36      5.79    5.11    0.00    0.89
4.11    3449    283286  34      6.06    5.40    0.00    0.89
4.11    3504    284532  35      6.35    5.70    0.00    0.89
4.11    3493    283908  35      6.26    5.60    0.00    0.89
4.11    3499    284243  34      6.27    5.62    0.00    0.89

Revert:

doug@s19:~/freq-scalers/long_dur$ sudo
/home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
--quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
--interval 300
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
4.12    3018    283114  34      4.57    3.91    0.00    0.89
4.12    3023    283482  34      4.59    3.94    0.00    0.89
4.12    3017    282609  34      4.71    4.05    0.00    0.89
4.12    3013    282898  34      4.64    3.99    0.00    0.89
4.12    3013    282956  36      4.56    3.90    0.00    0.89
4.12    3026    282807  34      4.61    3.95    0.00    0.89
4.12    3026    282738  35      4.50    3.84    0.00    0.89

Or average of 6.2 watts versus 4.6 watts, +35%.

Several other tests had undetectable power differences
between kernels, but I did not go back and re-test with
the improved task 2.

>
> Thanks,
> rui
>
> > For duration only tests, turbostat is not run at the same time.
> >
> > My grub line:
> >
> > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
> > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
> > cpuidle.governor=teo"
> >
> > A typical pstate tracer command (with the script copied to the directory
> > where I run this stuff:):
> >
> > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory
> > 800000
> >
> > >
> > > > > Can you try one test that keep all the current setting and change
> > > > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > > > from them be distributed to all CPUs?
> > > >
> > > > I am willing to do the test, but I do not know how to change the irq
> > > > affinity.
> > >
> > > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> > > (xx is the irq number of a device) to let interrupts be distributed to
> > > all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> > > Seems it only support one-cpu irq affinity in recent kernel.
> > >
> > > You can still try that command, though it may not work.
> >
> > I did not try this yet.
> >
> > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-13 18:54           ` Doug Smythies
@ 2022-02-14 15:17             ` srinivas pandruvada
  2022-02-15 21:35               ` Doug Smythies
  2022-02-22  7:34               ` Feng Tang
  0 siblings, 2 replies; 47+ messages in thread
From: srinivas pandruvada @ 2022-02-14 15:17 UTC (permalink / raw)
  To: Doug Smythies, Zhang, Rui
  Cc: Tang, Feng, Thomas Gleixner, paulmck, stable, x86, linux-pm

Hi Doug,

I think you use CONFIG_NO_HZ_FULL.
Here we are getting callback from scheduler. Can we check that if
scheduler woke up on those CPUs?
We can run "trace-cmd -e sched" and check in kernel shark if there is
similar gaps in activity.

Thanks,
Srinivas

On Sun, 2022-02-13 at 10:54 -0800, Doug Smythies wrote:
> Hi Rui,
> 
> On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui <rui.zhang@intel.com>
> wrote:
> > On 2022.02.09 14:23 Doug wrote:
> > > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@intel.com>
> > > wrote:
> > > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > > > > > 
> > > > > > > Since kernel 5.16-rc4 and commit:
> > > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > > > > " x86/tsc: Disable clocksource watchdog for TSC on
> > > > > > > qualified platorms"
> > > > > > > 
> > > > > > > There are now occasions where times between calls to the
> > > > > > > driver
> > > > > > > can be over 100's of seconds and can result in the CPU
> > > > > > > frequency
> > > > > > > being left unnecessarily high for extended periods.
> > > > > > > 
> > > > > > > From the number of clock cycles executed between these
> > > > > > > long
> > > > > > > durations one can tell that the CPU has been running
> > > > > > > code, but
> > > > > > > the driver never got called.
> > > > > > > 
> > > > > > > Attached are some graphs from some trace data acquired
> > > > > > > using
> > > > > > > intel_pstate_tracer.py where one can observe an idle
> > > > > > > system
> > > > > > > between about 42 and well over 200 seconds elapsed time,
> > > > > > > yet
> > > > > > > CPU10 never gets called, which would have resulted in
> > > > > > > reducing
> > > > > > > it's pstate request, until an elapsed time of 167.616
> > > > > > > seconds,
> > > > > > > 126 seconds since the last call. The CPU frequency never
> > > > > > > does go to
> > > minimum.
> > > > > > > 
> > > > > > > For reference, a similar CPU frequency graph is also
> > > > > > > attached,
> > > > > > > with the commit reverted. The CPU frequency drops to
> > > > > > > minimum,
> > > > > > > over about 10 or 15 seconds.,
> > > > > > 
> > > > > > commit b50db7095fe0 essentially disables the clocksource
> > > > > > watchdog,
> > > > > > which literally doesn't have much to do with cpufreq code.
> > > > > > 
> > > > > > One thing I can think of is, without the patch, there is a
> > > > > > periodic clocksource timer running every 500 ms, and it
> > > > > > loops to
> > > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from
> > > > > > the
> > > > > > graph), so each CPU will get a timer (HW timer interrupt
> > > > > > backed)
> > > > > > every 6 seconds. Could this affect the cpufreq governor's
> > > > > > work
> > > > > > flow (I just quickly read some cpufreq code, and seem there
> > > > > > is
> > > > > > irq_work/workqueue involved).
> > > > > 
> > > > > 6 Seconds is the longest duration I have ever seen on this
> > > > > processor
> > > > > before commit b50db7095fe0.
> > > > > 
> > > > > I said "the times between calls to the driver have never
> > > > > exceeded 10
> > > > > seconds" originally, but that involved other processors.
> > > > > 
> > > > > I also did longer, 9000 second tests:
> > > > > 
> > > > > For a reverted kernel the driver was called 131,743, and 0
> > > > > times the
> > > > > duration was longer than 6.1 seconds.
> > > > > 
> > > > > For a non-reverted kernel the driver was called 110,241
> > > > > times, and
> > > > > 1397 times the duration was longer than 6.1 seconds, and the
> > > > > maximum
> > > > > duration was 303.6 seconds
> > > > 
> > > > Thanks for the data, which shows it is related to the removal
> > > > of
> > > > clocksource watchdog timers. And under this specific
> > > > configurations,
> > > > the cpufreq work flow has some dependence on that watchdog
> > > > timers.
> > > > 
> > > > Also could you share you kernel config, boot message and some
> > > > system
> > > > settings like for tickless mode, so that other people can try
> > > > to
> > > > reproduce? thanks
> > > 
> > > I steal the kernel configuration file from the Ubuntu mainline
> > > PPA [1], what
> > > they call "lowlatency", or 1000Hz tick. I make these changes
> > > before compile:
> > > 
> > > scripts/config --disable DEBUG_INFO
> > > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --
> > > disable
> > > SYSTEM_REVOCATION_KEYS
> > > 
> > > I also send you the config and dmesg files in an off-list email.
> > > 
> > > This is an idle, and very low periodic loads, system type test.
> > > My test computer has no GUI and very few services running.
> > > Notice that I have not used the word "regression" yet in this
> > > thread, because
> > > I don't know for certain that it is. In the end, we don't care
> > > about CPU
> > > frequency, we care about wasting energy.
> > > It is definitely a change, and I am able to measure small
> > > increases in energy
> > > use, but this is all at the low end of the power curve.
> 
> Please keep the above statement in mind. The differences were rather
> minor,
> Since Rui asks for data below, I have tried to find better examples.
> 
> > What do you use to measure the energy use? And what difference do
> > you observe?
> 
> I use both turbostat and a processor power monitoring tool of my own.
> Don't get me wrong, I love turbostat, but it has become somewhat
> heavy
> (lots of code per interval) over recent years. My own utility has
> minimal
> code per interval, only uses pre-allocated memory with no disk or
> terminal
> interaction during the sampling phase, resulting in minimal system
> perturbation
> due to itself.
> 
> > > So far I have not found a significant example of increased power
> > > use, but I
> > > also have not looked very hard.
> 
> I looked a little harder for this reply, searching all single CPU
> loads for a few
> work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work
> packet
> per work interval.
> 
> > > During any test, many monitoring tools might shorten durations.
> > > For example if I run turbostat, say:
> > > 
> > > sudo turbostat --Summary --quiet --show
> > > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
> > > interval
> > > 2.5
> > > 
> > > Well, yes then the maximum duration would be 2.5 seconds, because
> > > turbostat wakes up each CPU to inquire about things causing a
> > > call to the CPU
> > > scaling driver. (I tested this, for about
> > > 900 seconds.)
> > > 
> > > For my power tests I use a sample interval of >= 300 seconds.
> > 
> > So you use something like "turbostat sleep 900" for power test,
> 
> Typically I use 300 seconds for turbostat for this work.
> It was only the other day, that I saw a duration of 302 seconds, so
> yes I should try even longer, but it is becoming a tradeoff between
> experiments taking many many hours and time available.
> 
> > and the RAPL Energy counters show the power difference?
> 
> Yes.
> 
> > Can you paste the turbostat output both w/ and w/o the watchdog?
> 
> Workflow:
> 
> Task 1: Purpose: main periodic load.
> 
> 411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz,
> max_turbo (it would limit out at 100% duty cycle at about pstate 13)
> 
> Note: this is a higher load than I was using when I started this
> email thread.
> 
> Task 2: Purpose: To assist in promoting manifestation of the
> issue of potential concern with commit b50db7095fe0.
> 
> A short burst of work (about 30 milliseconds @ max turbo,
> longer at lower frequencies), then sleep for 45 seconds.
> Say, almost 0 load at 0.022 hertz.
> Greater than or equal to 30 milliseconds at full load, ensures
> that the intel_pstate driver will be called at least 2 or 3 times,
> raising the requested pstate for that CPU.
> 
> Task 3: Purpose: Acquire power data with minimum system
> perturbation due to this very monitoring task.
> 
> Both turbostat and my own power monitor were used, never
> at the same time (unless just to prove they reported the same
> power). This turbostat command:
> 
> turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval
> 300
> 
> and my program also sampled at 300 second per sample,
> typically 5 samples, and after 1/2 hour since boot.
> 
> Kernels:
> stock: 5.17-rc3 as is.
> revert: 5.17-rc3 with b50db7095fe0 reverted.
> 
> Test 1: no-hwp/active/powersave:
> 
> Doug's cpu_power_mon:
> 
> Stock: 5 samples @ 300 seconds per sample:
> average: 4.7 watts +9%
> minimum: 4.3 watts
> maximum: 4.9 watts
> 
> Revert: 5 samples @ 300 seconds per sample:
> average: 4.3 watts
> minimum: 4.2 watts
> maximum: 4.5 watts
> 
> Test 2: no-hwp/passive/schedutil:
> 
> In the beginning active/powersave was used
> because it is the easiest (at least for me) to
> understand and interpret the intel_pstate_tracer.py
> results. Long durations are common in passive mode,
> because something higher up can decide not to call the driver
> because nothing changed. Very valuable lost information
> in my opinion.
> 
> I didn't figure it out for awhile, but schedutil is bi-stable
> with this workflow, depending on if it approaches steady
> state from a higher or lower previous load (i.e. hysteresis).
> With either kernel it might run for hours and hours at an
> average of 6.1 watts or 4.9 watts. This difference dominates,
> so trying to reveal if there is any contribution from the commit
> of this thread is useless.
> 
> Test 3: Similar workflow as test1, but 347 Hertz and
> a little less work per work period.
> 
> Task 2 was changed to use more CPUs to try
> to potentially amplify manifestation of the effect.
> 
>  Doug's cpu_power_mon:
> 
> Stock: 5 samples @ 300 seconds per sample:
> average: 4.2 watts +31%
> minimum: 3.5 watts
> maximum: 4.9 watts
> 
> Revert: 5 samples @ 300 seconds per sample:
> average: 3.2 watts
> minimum: 3.1 watts
> maximum: 3.2 watts
> 
> Re-do test 1, but with the improved task 2:
> 
> Turbostat:
> 
> Stock:
> 
> doug@s19:~$ sudo
> /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> --quiet --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
> --interval 300
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
> 4.09    3335    282024  36      5.79    5.11    0.00    0.89
> 4.11    3449    283286  34      6.06    5.40    0.00    0.89
> 4.11    3504    284532  35      6.35    5.70    0.00    0.89
> 4.11    3493    283908  35      6.26    5.60    0.00    0.89
> 4.11    3499    284243  34      6.27    5.62    0.00    0.89
> 
> Revert:
> 
> doug@s19:~/freq-scalers/long_dur$ sudo
> /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> --quiet --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
> --interval 300
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
> 4.12    3018    283114  34      4.57    3.91    0.00    0.89
> 4.12    3023    283482  34      4.59    3.94    0.00    0.89
> 4.12    3017    282609  34      4.71    4.05    0.00    0.89
> 4.12    3013    282898  34      4.64    3.99    0.00    0.89
> 4.12    3013    282956  36      4.56    3.90    0.00    0.89
> 4.12    3026    282807  34      4.61    3.95    0.00    0.89
> 4.12    3026    282738  35      4.50    3.84    0.00    0.89
> 
> Or average of 6.2 watts versus 4.6 watts, +35%.
> 
> Several other tests had undetectable power differences
> between kernels, but I did not go back and re-test with
> the improved task 2.
> 
> > 
> > Thanks,
> > rui
> > 
> > > For duration only tests, turbostat is not run at the same time.
> > > 
> > > My grub line:
> > > 
> > > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
> > > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
> > > cpuidle.governor=teo"
> > > 
> > > A typical pstate tracer command (with the script copied to the
> > > directory
> > > where I run this stuff:):
> > > 
> > > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --
> > > memory
> > > 800000
> > > 
> > > > 
> > > > > > Can you try one test that keep all the current setting and
> > > > > > change
> > > > > > the irq affinity of disk/network-card to 0xfff to let
> > > > > > interrupts
> > > > > > from them be distributed to all CPUs?
> > > > > 
> > > > > I am willing to do the test, but I do not know how to change
> > > > > the irq
> > > > > affinity.
> > > > 
> > > > I might say that too soon. I used to "echo fff >
> > > > /proc/irq/xxx/smp_affinity"
> > > > (xx is the irq number of a device) to let interrupts be
> > > > distributed to
> > > > all CPUs long time ago, but it doesn't work on my 2 desktops at
> > > > hand.
> > > > Seems it only support one-cpu irq affinity in recent kernel.
> > > > 
> > > > You can still try that command, though it may not work.
> > > 
> > > I did not try this yet.
> > > 
> > > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-14 15:17             ` srinivas pandruvada
@ 2022-02-15 21:35               ` Doug Smythies
  2022-02-22  7:34               ` Feng Tang
  1 sibling, 0 replies; 47+ messages in thread
From: Doug Smythies @ 2022-02-15 21:35 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Zhang, Rui, Tang, Feng, Thomas Gleixner, paulmck, stable, x86,
	linux-pm, dsmythies

Hi Srinivas,

On Mon, Feb 14, 2022 at 7:17 AM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> Hi Doug,
>
> I think you use CONFIG_NO_HZ_FULL.

No. Here is the relevant excerpt from the kernel config file:

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

> Here we are getting callback from scheduler. Can we check that if
> scheduler woke up on those CPUs?

As far as I can determine, yes.
But note that I am unfamiliar with this area.

> We can run "trace-cmd -e sched" and check in kernel shark if there is
> similar gaps in activity.

I do not use trace-cmd and had never heard of kernel shark.
Nor do I actually run any desktop GUI on linux, only servers.
I attempted to acquire what you wanted with primitive trace
commands.

Workflow: All as before (the 347 Hertz work/sleep frequency
test), with a 20 minute trace in the middle. Powers were
monitored again to confirm differences, and just in case trace
itself modified the system response (it didn't).

Power averages (excluding the sample where the trace
file was being written to disk):
Stock: 4.1 +37%
Revert:  3.0

I only looked at a few of the CPUs data, the largest, smallest
and a mid-range file sizes, excluding the main working CPU.
Maximum times between "sched_wakeup", seconds:

Stock:
CPU 2: 4.0
CPU 4: 4.0
CPU 9: 1.0

Revert:
CPU 1: 2.0
CPU 2: 4.0
CPU 7: 1.54

I do not know if other stuff in the files might be odd or not.

... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-14 15:17             ` srinivas pandruvada
  2022-02-15 21:35               ` Doug Smythies
@ 2022-02-22  7:34               ` Feng Tang
  2022-02-22 18:04                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-22  7:34 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Doug Smythies, Zhang, Rui, Thomas Gleixner, paulmck, stable, x86,
	linux-pm

On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> Hi Doug,
> 
> I think you use CONFIG_NO_HZ_FULL.
> Here we are getting callback from scheduler. Can we check that if
> scheduler woke up on those CPUs?
> We can run "trace-cmd -e sched" and check in kernel shark if there is
> similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
related with the cpufreq callback is not called timeley from scheduling
events:

" 
I mean we ignore the callback when the target CPU is not a local CPU as
we have to do IPI to adjust MSRs.
This will happen many times when sched_wake will wake up a new CPU for
the thread (we will get a callack for the target) but once the remote
thread start executing "sched_switch", we will get a callback on local
CPU, so we will adjust frequencies (provided 10ms interval from the
last call).

>From the trace file I see the scenario where it took 72sec between two
updates:
CPU 2
34412.597161    busy=78         freq=3232653
34484.450725    busy=63         freq=2606793

There is periodic activity in between, related to active load balancing
in scheduler (since last frequency was higher these small work will
also run at higher frequency). But those threads are not CFS class, so
scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a
sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
But calling for every class, will be too many callbacks and not sure we
can even call for "stop" class, which these migration threads are
using.
"

Following this direction, I made a hacky debug patch which should help
to restore the previous behavior.

Doug, could you help to try it? thanks

It basically tries to make sure the cpufreq-update-util be called timely
even for a silent system with very few interrupts (even from tick).

Thanks,
Feng

From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Tue, 22 Feb 2022 22:59:00 +0800
Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the
 cpufreq_update_util callback being called timely in silent system

---
 kernel/sched/idle.c  | 10 ++++++++++
 kernel/sched/sched.h | 13 +++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d17b0a5ce6ac..cc538acb3f1a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
  *
  * Called with polling cleared.
  */
+DEFINE_PER_CPU(u64, last_util_update_time);	/* in jiffies */
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	u64 expire;
 
 	/*
 	 * Check if we need to update blocked load
 	 */
 	nohz_run_idle_balance(cpu);
 
+#ifdef CONFIG_X86_INTEL_PSTATE
+	expire = __this_cpu_read(last_util_update_time) + HZ * 3;
+	if (unlikely(time_is_before_jiffies(expire))) {
+		idle_update_util();
+		__this_cpu_write(last_util_update_time, get_jiffies_64());
+	}
+#endif
+
 	/*
 	 * If the arch has a polling bit, we maintain an invariant:
 	 *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e66749486e7..2a8d87988d1f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
 	if (data)
 		data->func(data, rq_clock(rq), flags);
 }
+
+static inline void idle_update_util(void)
+{
+	struct update_util_data *data;
+	struct rq *rq = cpu_rq(raw_smp_processor_id());
+
+	data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
+						  cpu_of(rq)));
+	if (data)
+		data->func(data, rq_clock(rq), 0);
+}
+
+
 #else
 static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif /* CONFIG_CPU_FREQ */
-- 
2.25.1





^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-22  7:34               ` Feng Tang
@ 2022-02-22 18:04                 ` Rafael J. Wysocki
  2022-02-23  0:07                   ` Doug Smythies
                                     ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-22 18:04 UTC (permalink / raw)
  To: Feng Tang
  Cc: srinivas pandruvada, Doug Smythies, Zhang, Rui, Thomas Gleixner,
	paulmck, stable, x86, linux-pm

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > Hi Doug,
> >
> > I think you use CONFIG_NO_HZ_FULL.
> > Here we are getting callback from scheduler. Can we check that if
> > scheduler woke up on those CPUs?
> > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > similar gaps in activity.
>
> Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> related with the cpufreq callback is not called timeley from scheduling
> events:
>
> "
> I mean we ignore the callback when the target CPU is not a local CPU as
> we have to do IPI to adjust MSRs.
> This will happen many times when sched_wake will wake up a new CPU for
> the thread (we will get a callack for the target) but once the remote
> thread start executing "sched_switch", we will get a callback on local
> CPU, so we will adjust frequencies (provided 10ms interval from the
> last call).
>
> >From the trace file I see the scenario where it took 72sec between two
> updates:
> CPU 2
> 34412.597161    busy=78         freq=3232653
> 34484.450725    busy=63         freq=2606793
>
> There is periodic activity in between, related to active load balancing
> in scheduler (since last frequency was higher these small work will
> also run at higher frequency). But those threads are not CFS class, so
> scheduler callback will not be called for them.
>
> So removing the patch removed a trigger which would have caused a
> sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which
means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still
need to test it properly.

> But calling for every class, will be too many callbacks and not sure we
> can even call for "stop" class, which these migration threads are
> using.
> "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the
utilization now (see effective_cpu_util()), so doing callbacks only
for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep
idle via PM QoS if the CPUs are running at high frequencies.

> Following this direction, I made a hacky debug patch which should help
> to restore the previous behavior.
>
> Doug, could you help to try it? thanks
>
> It basically tries to make sure the cpufreq-update-util be called timely
> even for a silent system with very few interrupts (even from tick).
>
> Thanks,
> Feng
>
> From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Tue, 22 Feb 2022 22:59:00 +0800
> Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the
>  cpufreq_update_util callback being called timely in silent system
>
> ---
>  kernel/sched/idle.c  | 10 ++++++++++
>  kernel/sched/sched.h | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index d17b0a5ce6ac..cc538acb3f1a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
>   *
>   * Called with polling cleared.
>   */
> +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
>  static void do_idle(void)
>  {
>         int cpu = smp_processor_id();
> +       u64 expire;
>
>         /*
>          * Check if we need to update blocked load
>          */
>         nohz_run_idle_balance(cpu);
>
> +#ifdef CONFIG_X86_INTEL_PSTATE

Why?  Doesn't this affect the other ccpufreq governors?

> +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> +       if (unlikely(time_is_before_jiffies(expire))) {
> +               idle_update_util();
> +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> +       }
> +#endif
> +
>         /*
>          * If the arch has a polling bit, we maintain an invariant:
>          *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0e66749486e7..2a8d87988d1f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>         if (data)
>                 data->func(data, rq_clock(rq), flags);
>  }
> +
> +static inline void idle_update_util(void)
> +{
> +       struct update_util_data *data;
> +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> +
> +       data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> +                                                 cpu_of(rq)));
> +       if (data)
> +               data->func(data, rq_clock(rq), 0);
> +}
> +
> +
>  #else
>  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  #endif /* CONFIG_CPU_FREQ */
> --

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-22 18:04                 ` Rafael J. Wysocki
@ 2022-02-23  0:07                   ` Doug Smythies
  2022-02-23  0:32                     ` srinivas pandruvada
  2022-02-23  2:49                   ` Feng Tang
  2022-02-23  9:40                   ` Thomas Gleixner
  2 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-23  0:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Feng Tang, srinivas pandruvada, Zhang, Rui, Thomas Gleixner,
	paulmck, stable, x86, linux-pm, dsmythies

Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch",
let me know if I am wasting my time, and I'll abort. So far, it
works fine.

The compiler complains:

kernel/sched/idle.c: In function ‘do_idle’:
./include/linux/typecheck.h:12:18: warning: comparison of distinct
pointer types lacks a cast
   12 |  (void)(&__dummy == &__dummy2); \
      |                  ^~
./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’
   78 | # define unlikely(x) __builtin_expect(!!(x), 0)
      |                                          ^
./include/linux/jiffies.h:106:3: note: in expansion of macro ‘typecheck’
  106 |   typecheck(unsigned long, b) && \
      |   ^~~~~~~~~
./include/linux/jiffies.h:154:35: note: in expansion of macro ‘time_after’
  154 | #define time_is_before_jiffies(a) time_after(jiffies, a)
      |                                   ^~~~~~~~~~
kernel/sched/idle.c:274:15: note: in expansion of macro ‘time_is_before_jiffies’
  274 |  if (unlikely(time_is_before_jiffies(expire))) {

Test 1: 347 Hertz work/sleep frequency on one CPU while others do
virtually no load, but at around 0.02 hertz aggregate.
Processor package power (Watts):

Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per):
Average: 3.2
Min: 3.1
Max: 3.3

Kernel 5.17-rc3 (Stock) re-stated:
> Stock: 5 samples @ 300 seconds per sample:
> average: 4.2 watts +31%
> minimum: 3.5 watts
> maximum: 4.9 watts

Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:
> Revert: 5 samples @ 300 seconds per sample:
> average: 3.2 watts
> minimum: 3.1 watts
> maximum: 3.2 watts

I also ran intel_pstate_tracer.py for 5 hours 33 minutes
(20,000 seconds) on an idle system looking for long durations.
(just being thorough.) There were 12 occurrences of durations
longer than 6.1 seconds.
Max: 6.8 seconds. (O.K.)
I had expected about 3 seconds max, based on my
understanding of the patch code.

Old results restated, but corrected for a stupid mistake:

Stock:
1712 occurrences of durations longer than 6.1 seconds
Max: 303.6 seconds. (Bad)

Revert:
3 occurrences of durations longer than 6.1 seconds
Max: 6.5 seconds (O.K.)

On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > Hi Doug,
> > >
> > > I think you use CONFIG_NO_HZ_FULL.
> > > Here we are getting callback from scheduler. Can we check that if
> > > scheduler woke up on those CPUs?
> > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > similar gaps in activity.
> >
> > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > related with the cpufreq callback is not called timeley from scheduling
> > events:
> >
> > "
> > I mean we ignore the callback when the target CPU is not a local CPU as
> > we have to do IPI to adjust MSRs.
> > This will happen many times when sched_wake will wake up a new CPU for
> > the thread (we will get a callack for the target) but once the remote
> > thread start executing "sched_switch", we will get a callback on local
> > CPU, so we will adjust frequencies (provided 10ms interval from the
> > last call).
> >
> > >From the trace file I see the scenario where it took 72sec between two
> > updates:
> > CPU 2
> > 34412.597161    busy=78         freq=3232653
> > 34484.450725    busy=63         freq=2606793
> >
> > There is periodic activity in between, related to active load balancing
> > in scheduler (since last frequency was higher these small work will
> > also run at higher frequency). But those threads are not CFS class, so
> > scheduler callback will not be called for them.
> >
> > So removing the patch removed a trigger which would have caused a
> > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
>
> And so this behavior needs to be restored for the time being which
> means reverting the problematic commit for 5.17 if possible.
>
> It is unlikely that we'll get a proper fix before -rc7 and we still
> need to test it properly.
>
> > But calling for every class, will be too many callbacks and not sure we
> > can even call for "stop" class, which these migration threads are
> > using.
> > "
>
> Calling it for RT/deadline may not be a bad idea.
>
> schedutil takes these classes into account when computing the
> utilization now (see effective_cpu_util()), so doing callbacks only
> for CFS seems insufficient.
>
> Another way to avoid the issue at hand may be to prevent entering deep
> idle via PM QoS if the CPUs are running at high frequencies.
>
> > Following this direction, I made a hacky debug patch which should help
> > to restore the previous behavior.
> >
> > Doug, could you help to try it? thanks
> >
> > It basically tries to make sure the cpufreq-update-util be called timely
> > even for a silent system with very few interrupts (even from tick).
> >
> > Thanks,
> > Feng
> >
> > From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001
> > From: Feng Tang <feng.tang@intel.com>
> > Date: Tue, 22 Feb 2022 22:59:00 +0800
> > Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the
> >  cpufreq_update_util callback being called timely in silent system
> >
> > ---
> >  kernel/sched/idle.c  | 10 ++++++++++
> >  kernel/sched/sched.h | 13 +++++++++++++
> >  2 files changed, 23 insertions(+)
> >
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index d17b0a5ce6ac..cc538acb3f1a 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> >   *
> >   * Called with polling cleared.
> >   */
> > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> >  static void do_idle(void)
> >  {
> >         int cpu = smp_processor_id();
> > +       u64 expire;
> >
> >         /*
> >          * Check if we need to update blocked load
> >          */
> >         nohz_run_idle_balance(cpu);
> >
> > +#ifdef CONFIG_X86_INTEL_PSTATE
>
> Why?  Doesn't this affect the other ccpufreq governors?

Yes, I have the same question.

>
> > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > +       if (unlikely(time_is_before_jiffies(expire))) {
> > +               idle_update_util();
> > +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> > +       }
> > +#endif
> > +
> >         /*
> >          * If the arch has a polling bit, we maintain an invariant:
> >          *
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 0e66749486e7..2a8d87988d1f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
> >         if (data)
> >                 data->func(data, rq_clock(rq), flags);
> >  }
> > +
> > +static inline void idle_update_util(void)
> > +{
> > +       struct update_util_data *data;
> > +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> > +
> > +       data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> > +                                                 cpu_of(rq)));
> > +       if (data)
> > +               data->func(data, rq_clock(rq), 0);
> > +}
> > +
> > +
> >  #else
> >  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >  #endif /* CONFIG_CPU_FREQ */
> > --

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23  0:07                   ` Doug Smythies
@ 2022-02-23  0:32                     ` srinivas pandruvada
  2022-02-23  0:40                       ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: srinivas pandruvada @ 2022-02-23  0:32 UTC (permalink / raw)
  To: Doug Smythies, Rafael J. Wysocki
  Cc: Feng Tang, Zhang, Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> Hi All,
> 
> I am about 1/2 way through testing Feng's "hacky debug patch",
> let me know if I am wasting my time, and I'll abort. So far, it
> works fine.
This just proves that if you add some callback during long idle,  you
will reach a less aggressive p-state. I think you already proved that
with your results below showing 1W less average power ("Kernel 5.17-rc3
+ Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to
enter deep idle, set P-state to min with a callback like we do in
offline callback.

So we need to think about a proper solution for this.

Thanks,
Srinivas

> 
> The compiler complains:
> 
> kernel/sched/idle.c: In function ‘do_idle’:
> ./include/linux/typecheck.h:12:18: warning: comparison of distinct
> pointer types lacks a cast
>    12 |  (void)(&__dummy == &__dummy2); \
>       |                  ^~
> ./include/linux/compiler.h:78:42: note: in definition of macro
> ‘unlikely’
>    78 | # define unlikely(x) __builtin_expect(!!(x), 0)
>       |                                          ^
> ./include/linux/jiffies.h:106:3: note: in expansion of macro
> ‘typecheck’
>   106 |   typecheck(unsigned long, b) && \
>       |   ^~~~~~~~~
> ./include/linux/jiffies.h:154:35: note: in expansion of macro
> ‘time_after’
>   154 | #define time_is_before_jiffies(a) time_after(jiffies, a)
>       |                                   ^~~~~~~~~~
> kernel/sched/idle.c:274:15: note: in expansion of macro
> ‘time_is_before_jiffies’
>   274 |  if (unlikely(time_is_before_jiffies(expire))) {
> 
> Test 1: 347 Hertz work/sleep frequency on one CPU while others do
> virtually no load, but at around 0.02 hertz aggregate.
> Processor package power (Watts):
> 
> Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per):
> Average: 3.2
> Min: 3.1
> Max: 3.3
> 
> Kernel 5.17-rc3 (Stock) re-stated:
> > Stock: 5 samples @ 300 seconds per sample:
> > average: 4.2 watts +31%
> > minimum: 3.5 watts
> > maximum: 4.9 watts
> 
> Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:
> > Revert: 5 samples @ 300 seconds per sample:
> > average: 3.2 watts
> > minimum: 3.1 watts
> > maximum: 3.2 watts
> 
> I also ran intel_pstate_tracer.py for 5 hours 33 minutes
> (20,000 seconds) on an idle system looking for long durations.
> (just being thorough.) There were 12 occurrences of durations
> longer than 6.1 seconds.
> Max: 6.8 seconds. (O.K.)
> I had expected about 3 seconds max, based on my
> understanding of the patch code.
> 
> Old results restated, but corrected for a stupid mistake:
> 
> Stock:
> 1712 occurrences of durations longer than 6.1 seconds
> Max: 303.6 seconds. (Bad)
> 
> Revert:
> 3 occurrences of durations longer than 6.1 seconds
> Max: 6.5 seconds (O.K.)
> 
> On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki
> <rafael@kernel.org> wrote:
> > 
> > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com>
> > wrote:
> > > 
> > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada
> > > wrote:
> > > > Hi Doug,
> > > > 
> > > > I think you use CONFIG_NO_HZ_FULL.
> > > > Here we are getting callback from scheduler. Can we check that
> > > > if
> > > > scheduler woke up on those CPUs?
> > > > We can run "trace-cmd -e sched" and check in kernel shark if
> > > > there is
> > > > similar gaps in activity.
> > > 
> > > Srinivas analyzed the scheduler trace data from trace-cmd, and
> > > thought is
> > > related with the cpufreq callback is not called timeley from
> > > scheduling
> > > events:
> > > 
> > > "
> > > I mean we ignore the callback when the target CPU is not a local
> > > CPU as
> > > we have to do IPI to adjust MSRs.
> > > This will happen many times when sched_wake will wake up a new
> > > CPU for
> > > the thread (we will get a callack for the target) but once the
> > > remote
> > > thread start executing "sched_switch", we will get a callback on
> > > local
> > > CPU, so we will adjust frequencies (provided 10ms interval from
> > > the
> > > last call).
> > > 
> > > > From the trace file I see the scenario where it took 72sec
> > > > between two
> > > updates:
> > > CPU 2
> > > 34412.597161    busy=78         freq=3232653
> > > 34484.450725    busy=63         freq=2606793
> > > 
> > > There is periodic activity in between, related to active load
> > > balancing
> > > in scheduler (since last frequency was higher these small work
> > > will
> > > also run at higher frequency). But those threads are not CFS
> > > class, so
> > > scheduler callback will not be called for them.
> > > 
> > > So removing the patch removed a trigger which would have caused a
> > > sched_switch to a CFS task and call a cpufreq/intel_pstate
> > > callback.
> > 
> > And so this behavior needs to be restored for the time being which
> > means reverting the problematic commit for 5.17 if possible.
> > 
> > It is unlikely that we'll get a proper fix before -rc7 and we still
> > need to test it properly.
> > 
> > > But calling for every class, will be too many callbacks and not
> > > sure we
> > > can even call for "stop" class, which these migration threads are
> > > using.
> > > "
> > 
> > Calling it for RT/deadline may not be a bad idea.
> > 
> > schedutil takes these classes into account when computing the
> > utilization now (see effective_cpu_util()), so doing callbacks only
> > for CFS seems insufficient.
> > 
> > Another way to avoid the issue at hand may be to prevent entering
> > deep
> > idle via PM QoS if the CPUs are running at high frequencies.
> > 
> > > Following this direction, I made a hacky debug patch which should
> > > help
> > > to restore the previous behavior.
> > > 
> > > Doug, could you help to try it? thanks
> > > 
> > > It basically tries to make sure the cpufreq-update-util be called
> > > timely
> > > even for a silent system with very few interrupts (even from
> > > tick).
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00
> > > 2001
> > > From: Feng Tang <feng.tang@intel.com>
> > > Date: Tue, 22 Feb 2022 22:59:00 +0800
> > > Subject: [PATCH] idle/intel-pstate: hacky debug patch to make
> > > sure the
> > >  cpufreq_update_util callback being called timely in silent
> > > system
> > > 
> > > ---
> > >  kernel/sched/idle.c  | 10 ++++++++++
> > >  kernel/sched/sched.h | 13 +++++++++++++
> > >  2 files changed, 23 insertions(+)
> > > 
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index d17b0a5ce6ac..cc538acb3f1a 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> > >   *
> > >   * Called with polling cleared.
> > >   */
> > > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> > >  static void do_idle(void)
> > >  {
> > >         int cpu = smp_processor_id();
> > > +       u64 expire;
> > > 
> > >         /*
> > >          * Check if we need to update blocked load
> > >          */
> > >         nohz_run_idle_balance(cpu);
> > > 
> > > +#ifdef CONFIG_X86_INTEL_PSTATE
> > 
> > Why?  Doesn't this affect the other ccpufreq governors?
> 
> Yes, I have the same question.
> 
> > 
> > > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > > +       if (unlikely(time_is_before_jiffies(expire))) {
> > > +               idle_update_util();
> > > +               __this_cpu_write(last_util_update_time,
> > > get_jiffies_64());
> > > +       }
> > > +#endif
> > > +
> > >         /*
> > >          * If the arch has a polling bit, we maintain an
> > > invariant:
> > >          *
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 0e66749486e7..2a8d87988d1f 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2809,6 +2809,19 @@ static inline void
> > > cpufreq_update_util(struct rq *rq, unsigned int flags)
> > >         if (data)
> > >                 data->func(data, rq_clock(rq), flags);
> > >  }
> > > +
> > > +static inline void idle_update_util(void)
> > > +{
> > > +       struct update_util_data *data;
> > > +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> > > +
> > > +       data =
> > > rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> > > +                                                 cpu_of(rq)));
> > > +       if (data)
> > > +               data->func(data, rq_clock(rq), 0);
> > > +}
> > > +
> > > +
> > >  #else
> > >  static inline void cpufreq_update_util(struct rq *rq, unsigned
> > > int flags) {}
> > >  #endif /* CONFIG_CPU_FREQ */
> > > --


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23  0:32                     ` srinivas pandruvada
@ 2022-02-23  0:40                       ` Feng Tang
  2022-02-23 14:23                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-23  0:40 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Doug Smythies, Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner,
	paulmck, stable, x86, linux-pm

On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:
> Hi Doug,
> 
> On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> > Hi All,
> > 
> > I am about 1/2 way through testing Feng's "hacky debug patch",
> > let me know if I am wasting my time, and I'll abort. So far, it
> > works fine.
> This just proves that if you add some callback during long idle,  you
> will reach a less aggressive p-state. I think you already proved that
> with your results below showing 1W less average power ("Kernel 5.17-rc3
> + Feng patch (6 samples at 300 sec per").
> 
> Rafael replied with one possible option. Alternatively when planing to
> enter deep idle, set P-state to min with a callback like we do in
> offline callback.
 
Yes, if the system is going to idle, it makes sense to goto a lower
cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing
this before entering idle is not a big extra burden.

Thanks,
Feng

> So we need to think about a proper solution for this.
> 
> Thanks,
> Srinivas

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-22 18:04                 ` Rafael J. Wysocki
  2022-02-23  0:07                   ` Doug Smythies
@ 2022-02-23  2:49                   ` Feng Tang
  2022-02-23 14:11                     ` Rafael J. Wysocki
  2022-02-23  9:40                   ` Thomas Gleixner
  2 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-23  2:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: srinivas pandruvada, Doug Smythies, Zhang, Rui, Thomas Gleixner,
	paulmck, stable, x86, linux-pm

Hi Rafael,

On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:
> On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > Hi Doug,
> > >
> > > I think you use CONFIG_NO_HZ_FULL.
> > > Here we are getting callback from scheduler. Can we check that if
> > > scheduler woke up on those CPUs?
> > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > similar gaps in activity.
> >
> > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > related with the cpufreq callback is not called timeley from scheduling
> > events:
> >
> > "
> > I mean we ignore the callback when the target CPU is not a local CPU as
> > we have to do IPI to adjust MSRs.
> > This will happen many times when sched_wake will wake up a new CPU for
> > the thread (we will get a callack for the target) but once the remote
> > thread start executing "sched_switch", we will get a callback on local
> > CPU, so we will adjust frequencies (provided 10ms interval from the
> > last call).
> >
> > >From the trace file I see the scenario where it took 72sec between two
> > updates:
> > CPU 2
> > 34412.597161    busy=78         freq=3232653
> > 34484.450725    busy=63         freq=2606793
> >
> > There is periodic activity in between, related to active load balancing
> > in scheduler (since last frequency was higher these small work will
> > also run at higher frequency). But those threads are not CFS class, so
> > scheduler callback will not be called for them.
> >
> > So removing the patch removed a trigger which would have caused a
> > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
> 
> And so this behavior needs to be restored for the time being which
> means reverting the problematic commit for 5.17 if possible.
> 
> It is unlikely that we'll get a proper fix before -rc7 and we still
> need to test it properly.

Thanks for checking this! 

I understand the time limit as we are approaching the close of 5.17,
but still I don't think reverting commit b50db7095fe0 is the best
solution as:

* b50db7095fe0 is not just an optimization, but solves some real
  problems found in servers from big CSP (Cloud Service Provider)
  and data center's server room.

* IMHO, b50db7095fe0 itself hasn't done anything wrong.

* This problem found by Doug is a rarely happened case, though
  it is an expected thing as shown in existing comments of
  cfs_rq_util_change():

	/*
	 * There are a few boundary cases this might miss but it should
	 * get called often enough that that should (hopefully) not be
	 * a real problem.
	 *
	 * It will not get called when we go idle, because the idle
	 * thread is a different class (!fair), nor will the utilization
	 * number include things like RT tasks.
	 *
	 * As is, the util number is not freq-invariant (we'd have to
	 * implement arch_scale_freq_capacity() for that).
	 *
	 * See cpu_util_cfs().
	 */
	cpufreq_update_util(rq, flags);

As this happens with HWP-disabled case and a very calm system, can we
find a proper solution in 5.17/5.18 or later, and cc stable?

> > But calling for every class, will be too many callbacks and not sure we
> > can even call for "stop" class, which these migration threads are
> > using.
> > "
> 
> Calling it for RT/deadline may not be a bad idea.
> 
> schedutil takes these classes into account when computing the
> utilization now (see effective_cpu_util()), so doing callbacks only
> for CFS seems insufficient.
> 
> Another way to avoid the issue at hand may be to prevent entering deep
> idle via PM QoS if the CPUs are running at high frequencies.
> 
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> >   *
> >   * Called with polling cleared.
> >   */
> > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> >  static void do_idle(void)
> >  {
> >         int cpu = smp_processor_id();
> > +       u64 expire;
> >
> >         /*
> >          * Check if we need to update blocked load
> >          */
> >         nohz_run_idle_balance(cpu);
> >
> > +#ifdef CONFIG_X86_INTEL_PSTATE
> 
> Why?  Doesn't this affect the other ccpufreq governors?

You are right, this should be a generic thing.

I tried to limit its affect to other cases, though it's not necessary
for a debug patch.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-22 18:04                 ` Rafael J. Wysocki
  2022-02-23  0:07                   ` Doug Smythies
  2022-02-23  2:49                   ` Feng Tang
@ 2022-02-23  9:40                   ` Thomas Gleixner
  2022-02-23 14:23                     ` Rafael J. Wysocki
  2 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2022-02-23  9:40 UTC (permalink / raw)
  To: Rafael J. Wysocki, Feng Tang
  Cc: srinivas pandruvada, Doug Smythies, Zhang, Rui, paulmck, stable,
	x86, linux-pm

Rafael,

On Tue, Feb 22 2022 at 19:04, Rafael J. Wysocki wrote:
> On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
>> There is periodic activity in between, related to active load balancing
>> in scheduler (since last frequency was higher these small work will
>> also run at higher frequency). But those threads are not CFS class, so
>> scheduler callback will not be called for them.
>>
>> So removing the patch removed a trigger which would have caused a
>> sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
>
> And so this behavior needs to be restored for the time being which
> means reverting the problematic commit for 5.17 if possible.

No. This is just papering over the problem. Just because the clocksource
watchdog has the side effect of making cpufreq "work", does not make it
a prerequisite for cpufreq. The commit unearthed a problem in the
cpufreq code, so it needs to be fixed there.

Even if we'd revert it then, you can produce the same effect by adding
'tsc=reliable' to the kernel command line which disables the clocksource
watchdog too. The commit is there to deal with modern hardware without
requiring people to add 'tsc=reliable' to the command line.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23  2:49                   ` Feng Tang
@ 2022-02-23 14:11                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-23 14:11 UTC (permalink / raw)
  To: Feng Tang
  Cc: Rafael J. Wysocki, srinivas pandruvada, Doug Smythies, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

On Wed, Feb 23, 2022 at 3:49 AM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Rafael,
>
> On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > > Hi Doug,
> > > >
> > > > I think you use CONFIG_NO_HZ_FULL.
> > > > Here we are getting callback from scheduler. Can we check that if
> > > > scheduler woke up on those CPUs?
> > > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > > similar gaps in activity.
> > >
> > > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > > related with the cpufreq callback is not called timeley from scheduling
> > > events:
> > >
> > > "
> > > I mean we ignore the callback when the target CPU is not a local CPU as
> > > we have to do IPI to adjust MSRs.
> > > This will happen many times when sched_wake will wake up a new CPU for
> > > the thread (we will get a callack for the target) but once the remote
> > > thread start executing "sched_switch", we will get a callback on local
> > > CPU, so we will adjust frequencies (provided 10ms interval from the
> > > last call).
> > >
> > > >From the trace file I see the scenario where it took 72sec between two
> > > updates:
> > > CPU 2
> > > 34412.597161    busy=78         freq=3232653
> > > 34484.450725    busy=63         freq=2606793
> > >
> > > There is periodic activity in between, related to active load balancing
> > > in scheduler (since last frequency was higher these small work will
> > > also run at higher frequency). But those threads are not CFS class, so
> > > scheduler callback will not be called for them.
> > >
> > > So removing the patch removed a trigger which would have caused a
> > > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
> >
> > And so this behavior needs to be restored for the time being which
> > means reverting the problematic commit for 5.17 if possible.
> >
> > It is unlikely that we'll get a proper fix before -rc7 and we still
> > need to test it properly.
>
> Thanks for checking this!
>
> I understand the time limit as we are approaching the close of 5.17,
> but still I don't think reverting commit b50db7095fe0 is the best
> solution as:
>
> * b50db7095fe0 is not just an optimization, but solves some real
>   problems found in servers from big CSP (Cloud Service Provider)
>   and data center's server room.
>
> * IMHO, b50db7095fe0 itself hasn't done anything wrong.
>
> * This problem found by Doug is a rarely happened case, though
>   it is an expected thing as shown in existing comments of
>   cfs_rq_util_change():
>
>         /*
>          * There are a few boundary cases this might miss but it should
>          * get called often enough that that should (hopefully) not be
>          * a real problem.
>          *
>          * It will not get called when we go idle, because the idle
>          * thread is a different class (!fair), nor will the utilization
>          * number include things like RT tasks.
>          *
>          * As is, the util number is not freq-invariant (we'd have to
>          * implement arch_scale_freq_capacity() for that).
>          *
>          * See cpu_util_cfs().
>          */
>         cpufreq_update_util(rq, flags);
>
> As this happens with HWP-disabled case and a very calm system, can we
> find a proper solution in 5.17/5.18 or later, and cc stable?

We can.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23  0:40                       ` Feng Tang
@ 2022-02-23 14:23                         ` Rafael J. Wysocki
  2022-02-24  8:08                           ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-23 14:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: srinivas pandruvada, Doug Smythies, Rafael J. Wysocki, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:
> > Hi Doug,
> >
> > On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> > > Hi All,
> > >
> > > I am about 1/2 way through testing Feng's "hacky debug patch",
> > > let me know if I am wasting my time, and I'll abort. So far, it
> > > works fine.
> > This just proves that if you add some callback during long idle,  you
> > will reach a less aggressive p-state. I think you already proved that
> > with your results below showing 1W less average power ("Kernel 5.17-rc3
> > + Feng patch (6 samples at 300 sec per").
> >
> > Rafael replied with one possible option. Alternatively when planing to
> > enter deep idle, set P-state to min with a callback like we do in
> > offline callback.
>
> Yes, if the system is going to idle, it makes sense to goto a lower
> cpufreq first (also what my debug patch will essentially lead to).
>
> Given cprfreq-util's normal running frequency is every 10ms, doing
> this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle
sometimes stops the scheduler tick which otherwise would run the
cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid
doing it if the current performance state is too aggressive.

In principle, PM QoS can be used for that from intel_pstate, but there
is a problem with that approach, because it is not obvious what value
to give to it and it is not always guaranteed to work (say when all of
the C-states except for C1 are disabled).

So it looks like a new mechanism is needed for that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23  9:40                   ` Thomas Gleixner
@ 2022-02-23 14:23                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-23 14:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rafael J. Wysocki, Feng Tang, srinivas pandruvada, Doug Smythies,
	Zhang, Rui, paulmck, stable, x86, linux-pm

On Wed, Feb 23, 2022 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Rafael,
>
> On Tue, Feb 22 2022 at 19:04, Rafael J. Wysocki wrote:
> > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com> wrote:
> >> There is periodic activity in between, related to active load balancing
> >> in scheduler (since last frequency was higher these small work will
> >> also run at higher frequency). But those threads are not CFS class, so
> >> scheduler callback will not be called for them.
> >>
> >> So removing the patch removed a trigger which would have caused a
> >> sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
> >
> > And so this behavior needs to be restored for the time being which
> > means reverting the problematic commit for 5.17 if possible.
>
> No. This is just papering over the problem. Just because the clocksource
> watchdog has the side effect of making cpufreq "work", does not make it
> a prerequisite for cpufreq. The commit unearthed a problem in the
> cpufreq code, so it needs to be fixed there.
>
> Even if we'd revert it then, you can produce the same effect by adding
> 'tsc=reliable' to the kernel command line which disables the clocksource
> watchdog too. The commit is there to deal with modern hardware without
> requiring people to add 'tsc=reliable' to the command line.

Allright (but I'll remember this exception).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-23 14:23                         ` Rafael J. Wysocki
@ 2022-02-24  8:08                           ` Feng Tang
  2022-02-24 14:44                             ` Paul E. McKenney
  2022-02-25 17:45                             ` Rafael J. Wysocki
  0 siblings, 2 replies; 47+ messages in thread
From: Feng Tang @ 2022-02-24  8:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: srinivas pandruvada, Doug Smythies, Zhang, Rui, Thomas Gleixner,
	paulmck, stable, x86, linux-pm

On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:
> On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:
> > > Hi Doug,
> > >
> > > On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> > > > Hi All,
> > > >
> > > > I am about 1/2 way through testing Feng's "hacky debug patch",
> > > > let me know if I am wasting my time, and I'll abort. So far, it
> > > > works fine.
> > > This just proves that if you add some callback during long idle,  you
> > > will reach a less aggressive p-state. I think you already proved that
> > > with your results below showing 1W less average power ("Kernel 5.17-rc3
> > > + Feng patch (6 samples at 300 sec per").
> > >
> > > Rafael replied with one possible option. Alternatively when planing to
> > > enter deep idle, set P-state to min with a callback like we do in
> > > offline callback.
> >
> > Yes, if the system is going to idle, it makes sense to goto a lower
> > cpufreq first (also what my debug patch will essentially lead to).
> >
> > Given cprfreq-util's normal running frequency is every 10ms, doing
> > this before entering idle is not a big extra burden.
> 
> But this is not related to idle as such, but to the fact that idle
> sometimes stops the scheduler tick which otherwise would run the
> cpufreq governor callback on a regular basis.
> 
> It is stopping the tick that gets us into trouble, so I would avoid
> doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and
offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
still see Local timer interrupts when there is no active load, with
the longest interval between 2 timer interrupts is 4 seconds, while
idle class's task_tick_idle() will do nothing, and CFS'
task_tick_fair() will in turn call cfs_rq_util_change()

I searched the cfs/deadline/rt code, these three classes  all have
places to call cpufreq_update_util(), either in enqueue/dequeue or
changing running bandwidth. So I think entering idle also means the
system load is under a big change, and worth calling the cpufreq
callback.

> In principle, PM QoS can be used for that from intel_pstate, but there
> is a problem with that approach, because it is not obvious what value
> to give to it and it is not always guaranteed to work (say when all of
> the C-states except for C1 are disabled).
> 
> So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can
also help testing new patches.

Thanks,
Feng


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-24  8:08                           ` Feng Tang
@ 2022-02-24 14:44                             ` Paul E. McKenney
  2022-02-24 16:29                               ` Doug Smythies
  2022-02-25  0:29                               ` Feng Tang
  2022-02-25 17:45                             ` Rafael J. Wysocki
  1 sibling, 2 replies; 47+ messages in thread
From: Paul E. McKenney @ 2022-02-24 14:44 UTC (permalink / raw)
  To: Feng Tang
  Cc: Rafael J. Wysocki, srinivas pandruvada, Doug Smythies, Zhang,
	Rui, Thomas Gleixner, stable, x86, linux-pm

On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:
> On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:
> > On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:
> > > > Hi Doug,
> > > >
> > > > On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> > > > > Hi All,
> > > > >
> > > > > I am about 1/2 way through testing Feng's "hacky debug patch",
> > > > > let me know if I am wasting my time, and I'll abort. So far, it
> > > > > works fine.
> > > > This just proves that if you add some callback during long idle,  you
> > > > will reach a less aggressive p-state. I think you already proved that
> > > > with your results below showing 1W less average power ("Kernel 5.17-rc3
> > > > + Feng patch (6 samples at 300 sec per").
> > > >
> > > > Rafael replied with one possible option. Alternatively when planing to
> > > > enter deep idle, set P-state to min with a callback like we do in
> > > > offline callback.
> > >
> > > Yes, if the system is going to idle, it makes sense to goto a lower
> > > cpufreq first (also what my debug patch will essentially lead to).
> > >
> > > Given cprfreq-util's normal running frequency is every 10ms, doing
> > > this before entering idle is not a big extra burden.
> > 
> > But this is not related to idle as such, but to the fact that idle
> > sometimes stops the scheduler tick which otherwise would run the
> > cpufreq governor callback on a regular basis.
> > 
> > It is stopping the tick that gets us into trouble, so I would avoid
> > doing it if the current performance state is too aggressive.
> 
> I've tried to simulate Doug's environment by using his kconfig, and
> offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
> still see Local timer interrupts when there is no active load, with
> the longest interval between 2 timer interrupts is 4 seconds, while
> idle class's task_tick_idle() will do nothing, and CFS'
> task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds?  Could you please post your .config?

							Thanx, Paul

> I searched the cfs/deadline/rt code, these three classes  all have
> places to call cpufreq_update_util(), either in enqueue/dequeue or
> changing running bandwidth. So I think entering idle also means the
> system load is under a big change, and worth calling the cpufreq
> callback.
> 
> > In principle, PM QoS can be used for that from intel_pstate, but there
> > is a problem with that approach, because it is not obvious what value
> > to give to it and it is not always guaranteed to work (say when all of
> > the C-states except for C1 are disabled).
> > 
> > So it looks like a new mechanism is needed for that.
> 
> If you think idle class is not the right place to solve it, I can
> also help testing new patches.
> 
> Thanks,
> Feng
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-24 14:44                             ` Paul E. McKenney
@ 2022-02-24 16:29                               ` Doug Smythies
  2022-02-24 16:58                                 ` Paul E. McKenney
  2022-02-25  0:29                               ` Feng Tang
  1 sibling, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-24 16:29 UTC (permalink / raw)
  To: paulmck
  Cc: 'Rafael J. Wysocki', 'srinivas pandruvada',
	'Zhang, Rui', 'Thomas Gleixner',
	stable, x86, linux-pm, Doug Smythies, 'Feng Tang'


On 2022.02.24 04:08:30 Paul E. McKenney wrote:
> On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:
>> On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:
>>> On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
>>> 
>>> But this is not related to idle as such, but to the fact that idle
>>> sometimes stops the scheduler tick which otherwise would run the
>>> cpufreq governor callback on a regular basis.
>>> 
>>> It is stopping the tick that gets us into trouble, so I would avoid
>>> doing it if the current performance state is too aggressive.
>> 
>> I've tried to simulate Doug's environment by using his kconfig, and
>> offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
>> still see Local timer interrupts when there is no active load, with
>> the longest interval between 2 timer interrupts is 4 seconds, while
>> idle class's task_tick_idle() will do nothing, and CFS'
>> task_tick_fair() will in turn call cfs_rq_util_change()
>
> Every four seconds?  Could you please post your .config?
>
>							Thanx, Paul

I steal the kernel config from the Ubuntu mainline PPA.
See also earlier on this thread:

https://lore.kernel.org/linux-pm/CAAYoRsXkyWf0vmEE2HvjF6pzCC4utxTF=7AFx1PJv4Evh=C+Ow@mail.gmail.com/

but relevant part copied here:

> I steal the kernel configuration file from the Ubuntu mainline PPA
> [1], what they call "lowlatency", or 1000Hz tick. I make these
> changes before compile:
>
> scripts/config --disable DEBUG_INFO
> scripts/config --disable SYSTEM_TRUSTED_KEYS
> scripts/config --disable SYSTEM_REVOCATION_KEYS
>
> I [will] also send you the config and dmesg files in an off-list email.
>
> [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

I put the same one I sent Feng on my web site where I was
sharing stuff with Srinivas (coded to avoid the barrage of bots):

double u double u double u dot smythies dot com/~doug/linux/s18/hwp/srinivas/long_dur/
 
... Doug



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-24 16:29                               ` Doug Smythies
@ 2022-02-24 16:58                                 ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2022-02-24 16:58 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'srinivas pandruvada',
	'Zhang, Rui', 'Thomas Gleixner',
	stable, x86, linux-pm, 'Feng Tang'

On Thu, Feb 24, 2022 at 08:29:22AM -0800, Doug Smythies wrote:
> 
> On 2022.02.24 04:08:30 Paul E. McKenney wrote:
> > On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:
> >> On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:
> >>> On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
> >>> 
> >>> But this is not related to idle as such, but to the fact that idle
> >>> sometimes stops the scheduler tick which otherwise would run the
> >>> cpufreq governor callback on a regular basis.
> >>> 
> >>> It is stopping the tick that gets us into trouble, so I would avoid
> >>> doing it if the current performance state is too aggressive.
> >> 
> >> I've tried to simulate Doug's environment by using his kconfig, and
> >> offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
> >> still see Local timer interrupts when there is no active load, with
> >> the longest interval between 2 timer interrupts is 4 seconds, while
> >> idle class's task_tick_idle() will do nothing, and CFS'
> >> task_tick_fair() will in turn call cfs_rq_util_change()
> >
> > Every four seconds?  Could you please post your .config?
> >
> >							Thanx, Paul
> 
> I steal the kernel config from the Ubuntu mainline PPA.
> See also earlier on this thread:
> 
> https://lore.kernel.org/linux-pm/CAAYoRsXkyWf0vmEE2HvjF6pzCC4utxTF=7AFx1PJv4Evh=C+Ow@mail.gmail.com/
> 
> but relevant part copied here:
> 
> > I steal the kernel configuration file from the Ubuntu mainline PPA
> > [1], what they call "lowlatency", or 1000Hz tick. I make these
> > changes before compile:
> >
> > scripts/config --disable DEBUG_INFO
> > scripts/config --disable SYSTEM_TRUSTED_KEYS
> > scripts/config --disable SYSTEM_REVOCATION_KEYS
> >
> > I [will] also send you the config and dmesg files in an off-list email.
> >
> > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/
> 
> I put the same one I sent Feng on my web site where I was
> sharing stuff with Srinivas (coded to avoid the barrage of bots):
> 
> double u double u double u dot smythies dot com/~doug/linux/s18/hwp/srinivas/long_dur/

Thank you!

I don't see CONFIG_FAST_NO_HZ=y in your .config, so that is not the
reason for your every-four-second timers.  ;-)

(CONFIG_FAST_NO_HZ is being removed, FYI.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-24 14:44                             ` Paul E. McKenney
  2022-02-24 16:29                               ` Doug Smythies
@ 2022-02-25  0:29                               ` Feng Tang
  2022-02-25  1:06                                 ` Paul E. McKenney
  1 sibling, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-25  0:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rafael J. Wysocki, srinivas pandruvada, Doug Smythies, Zhang,
	Rui, Thomas Gleixner, stable, x86, linux-pm

Hi Paul,

On Thu, Feb 24, 2022 at 06:44:23AM -0800, Paul E. McKenney wrote:
[...]
> > > > > Rafael replied with one possible option. Alternatively when planing to
> > > > > enter deep idle, set P-state to min with a callback like we do in
> > > > > offline callback.
> > > >
> > > > Yes, if the system is going to idle, it makes sense to goto a lower
> > > > cpufreq first (also what my debug patch will essentially lead to).
> > > >
> > > > Given cprfreq-util's normal running frequency is every 10ms, doing
> > > > this before entering idle is not a big extra burden.
> > > 
> > > But this is not related to idle as such, but to the fact that idle
> > > sometimes stops the scheduler tick which otherwise would run the
> > > cpufreq governor callback on a regular basis.
> > > 
> > > It is stopping the tick that gets us into trouble, so I would avoid
> > > doing it if the current performance state is too aggressive.
> > 
> > I've tried to simulate Doug's environment by using his kconfig, and
> > offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
> > still see Local timer interrupts when there is no active load, with
> > the longest interval between 2 timer interrupts is 4 seconds, while
> > idle class's task_tick_idle() will do nothing, and CFS'
> > task_tick_fair() will in turn call cfs_rq_util_change()
> 
> Every four seconds?  Could you please post your .config?
 
Aha, I didn't make it clear, that the timer interrupt was not always
coming every 4 seconds, but when system is silent, the maxim interval
between 2 timer interrupts was 4 seconds.

When initially I checked this, I doubted if the timer interrupt are
too few on the system, so I used Doug's config and tried to make my
desktop silent (like disabling GUI), following is some trace_printk
log, though I figured out later when idle thread is running, the
idle class' scheduler tick will not help as it doesn't call cpufreq
callback.

          <idle>-0       [009] d.h1.   235.980053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.981054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.982053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.983053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.984053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.985053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.986054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.987054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.988054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.989054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.990054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.991053: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.992054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.993054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   235.994054: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   236.331126: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   236.460130: hrtimer_interrupt: enter
          <idle>-0       [009] d.s5.   236.460147: intel_pstate_update_util: old_state=48 new=27
          <idle>-0       [009] d.h1.   238.380130: hrtimer_interrupt: enter
          <idle>-0       [009] d.s5.   238.380147: intel_pstate_update_util: old_state=27 new=12
          <idle>-0       [009] d.h1.   240.331133: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   240.364133: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   244.331135: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   248.331139: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   252.331138: hrtimer_interrupt: enter
           <...>-1167    [009] d.h..   254.860056: hrtimer_interrupt: enter
           snapd-1128    [009] d.h..   254.861054: hrtimer_interrupt: enter
           snapd-1128    [009] d.h..   254.862055: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   254.863056: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   254.864056: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   254.865055: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   256.331133: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   260.331127: hrtimer_interrupt: enter
          <idle>-0       [009] d.h1.   264.331135: hrtimer_interrupt: enter

Thanks,
Feng

> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-25  0:29                               ` Feng Tang
@ 2022-02-25  1:06                                 ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2022-02-25  1:06 UTC (permalink / raw)
  To: Feng Tang
  Cc: Rafael J. Wysocki, srinivas pandruvada, Doug Smythies, Zhang,
	Rui, Thomas Gleixner, stable, x86, linux-pm

On Fri, Feb 25, 2022 at 08:29:51AM +0800, Feng Tang wrote:
> Hi Paul,
> 
> On Thu, Feb 24, 2022 at 06:44:23AM -0800, Paul E. McKenney wrote:
> [...]
> > > > > > Rafael replied with one possible option. Alternatively when planing to
> > > > > > enter deep idle, set P-state to min with a callback like we do in
> > > > > > offline callback.
> > > > >
> > > > > Yes, if the system is going to idle, it makes sense to goto a lower
> > > > > cpufreq first (also what my debug patch will essentially lead to).
> > > > >
> > > > > Given cprfreq-util's normal running frequency is every 10ms, doing
> > > > > this before entering idle is not a big extra burden.
> > > > 
> > > > But this is not related to idle as such, but to the fact that idle
> > > > sometimes stops the scheduler tick which otherwise would run the
> > > > cpufreq governor callback on a regular basis.
> > > > 
> > > > It is stopping the tick that gets us into trouble, so I would avoid
> > > > doing it if the current performance state is too aggressive.
> > > 
> > > I've tried to simulate Doug's environment by using his kconfig, and
> > > offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
> > > still see Local timer interrupts when there is no active load, with
> > > the longest interval between 2 timer interrupts is 4 seconds, while
> > > idle class's task_tick_idle() will do nothing, and CFS'
> > > task_tick_fair() will in turn call cfs_rq_util_change()
> > 
> > Every four seconds?  Could you please post your .config?
>  
> Aha, I didn't make it clear, that the timer interrupt was not always
> coming every 4 seconds, but when system is silent, the maxim interval
> between 2 timer interrupts was 4 seconds.
> 
> When initially I checked this, I doubted if the timer interrupt are
> too few on the system, so I used Doug's config and tried to make my
> desktop silent (like disabling GUI), following is some trace_printk
> log, though I figured out later when idle thread is running, the
> idle class' scheduler tick will not help as it doesn't call cpufreq
> callback.
> 
>           <idle>-0       [009] d.h1.   235.980053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.981054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.982053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.983053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.984053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.985053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.986054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.987054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.988054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.989054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.990054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.991053: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.992054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.993054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   235.994054: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   236.331126: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   236.460130: hrtimer_interrupt: enter
>           <idle>-0       [009] d.s5.   236.460147: intel_pstate_update_util: old_state=48 new=27
>           <idle>-0       [009] d.h1.   238.380130: hrtimer_interrupt: enter
>           <idle>-0       [009] d.s5.   238.380147: intel_pstate_update_util: old_state=27 new=12
>           <idle>-0       [009] d.h1.   240.331133: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   240.364133: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   244.331135: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   248.331139: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   252.331138: hrtimer_interrupt: enter
>            <...>-1167    [009] d.h..   254.860056: hrtimer_interrupt: enter
>            snapd-1128    [009] d.h..   254.861054: hrtimer_interrupt: enter
>            snapd-1128    [009] d.h..   254.862055: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   254.863056: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   254.864056: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   254.865055: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   256.331133: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   260.331127: hrtimer_interrupt: enter
>           <idle>-0       [009] d.h1.   264.331135: hrtimer_interrupt: enter

Thank you for the clarification!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-24  8:08                           ` Feng Tang
  2022-02-24 14:44                             ` Paul E. McKenney
@ 2022-02-25 17:45                             ` Rafael J. Wysocki
  2022-02-26  0:36                               ` Doug Smythies
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-25 17:45 UTC (permalink / raw)
  To: Feng Tang, Doug Smythies
  Cc: Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm

On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
> On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:
> > On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:
> > > > Hi Doug,
> > > >
> > > > On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> > > > > Hi All,
> > > > >
> > > > > I am about 1/2 way through testing Feng's "hacky debug patch",
> > > > > let me know if I am wasting my time, and I'll abort. So far, it
> > > > > works fine.
> > > > This just proves that if you add some callback during long idle,  you
> > > > will reach a less aggressive p-state. I think you already proved that
> > > > with your results below showing 1W less average power ("Kernel 5.17-rc3
> > > > + Feng patch (6 samples at 300 sec per").
> > > >
> > > > Rafael replied with one possible option. Alternatively when planing to
> > > > enter deep idle, set P-state to min with a callback like we do in
> > > > offline callback.
> > >
> > > Yes, if the system is going to idle, it makes sense to goto a lower
> > > cpufreq first (also what my debug patch will essentially lead to).
> > >
> > > Given cprfreq-util's normal running frequency is every 10ms, doing
> > > this before entering idle is not a big extra burden.
> > 
> > But this is not related to idle as such, but to the fact that idle
> > sometimes stops the scheduler tick which otherwise would run the
> > cpufreq governor callback on a regular basis.
> > 
> > It is stopping the tick that gets us into trouble, so I would avoid
> > doing it if the current performance state is too aggressive.
> 
> I've tried to simulate Doug's environment by using his kconfig, and
> offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can
> still see Local timer interrupts when there is no active load, with
> the longest interval between 2 timer interrupts is 4 seconds, while
> idle class's task_tick_idle() will do nothing, and CFS'
> task_tick_fair() will in turn call cfs_rq_util_change()
> 
> I searched the cfs/deadline/rt code, these three classes  all have
> places to call cpufreq_update_util(), either in enqueue/dequeue or
> changing running bandwidth. So I think entering idle also means the
> system load is under a big change, and worth calling the cpufreq
> callback.
> 
> > In principle, PM QoS can be used for that from intel_pstate, but there
> > is a problem with that approach, because it is not obvious what value
> > to give to it and it is not always guaranteed to work (say when all of
> > the C-states except for C1 are disabled).
> > 
> > So it looks like a new mechanism is needed for that.
> 
> If you think idle class is not the right place to solve it, I can
> also help testing new patches.

So I have the appended experimental patch to address this issue that's not
been tested at all.  Caveat emptor.

I've been looking for something relatively low-overhead and taking all of the
dependencies into account.

---
 drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
 drivers/cpuidle/governors/ladder.c |    6 +++--
 drivers/cpuidle/governors/menu.c   |    2 +
 drivers/cpuidle/governors/teo.c    |    3 ++
 include/linux/cpuidle.h            |    4 +++
 5 files changed, 44 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr
 	if (unlikely(delta < 0)) {
 		delta = 0;
 		delta_tick = 0;
+	} else if (dev->retain_tick) {
+		delta = delta_tick;
 	}
 	data->next_timer_ns = delta;
 
Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -308,6 +308,9 @@ static int teo_select(struct cpuidle_dri
 	cpu_data->time_span_ns = local_clock();
 
 	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+	if (dev->retain_tick)
+		duration_ns = delta_tick;
+
 	cpu_data->sleep_length_ns = duration_ns;
 
 	/* Check if there is any choice in the first place. */
Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -93,6 +93,7 @@ struct cpuidle_device {
 	unsigned int		registered:1;
 	unsigned int		enabled:1;
 	unsigned int		poll_time_limit:1;
+	unsigned int		retain_tick:1;
 	unsigned int		cpu;
 	ktime_t			next_hrtimer;
 
@@ -172,6 +173,8 @@ extern int cpuidle_play_dead(void);
 extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
 static inline struct cpuidle_device *cpuidle_get_device(void)
 {return __this_cpu_read(cpuidle_devices); }
+
+extern void cpuidle_update_retain_tick(bool val);
 #else
 static inline void disable_cpuidle(void) { }
 static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
@@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void
 static inline struct cpuidle_driver *cpuidle_get_cpu_driver(
 	struct cpuidle_device *dev) {return NULL; }
 static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; }
+static inline void cpuidle_update_retain_tick(bool val) { }
 #endif
 
 #ifdef CONFIG_CPU_IDLE
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -19,6 +19,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/fs.h>
@@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set
 }
 #endif /* CONFIG_ACPI_CPPC_LIB */
 
+static void intel_pstate_update_perf_ctl(struct cpudata *cpu)
+{
+	int pstate = cpu->pstate.current_pstate;
+
+	/*
+	 * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
+	 * P-states to prevent them from getting back to the high frequency
+	 * right away after getting out of deep idle.
+	 */
+	cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
+	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
+}
+
+static void intel_pstate_update_perf_ctl_wrapper(void *cpu_data)
+{
+	intel_pstate_update_perf_ctl(cpu_data);
+}
+
+static void intel_pstate_update_perf_ctl_on_cpu(struct cpudata *cpu)
+{
+	smp_call_function_single(cpu->cpu, intel_pstate_update_perf_ctl_wrapper,
+				 cpu, 1);
+}
+
 static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
 {
 	trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
@@ -1979,8 +2004,7 @@ static void intel_pstate_set_pstate(stru
 	 * the CPU being updated, so force the register update to run on the
 	 * right CPU.
 	 */
-	wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
-		      pstate_funcs.get_val(cpu, pstate));
+	intel_pstate_update_perf_ctl_on_cpu(cpu);
 }
 
 static void intel_pstate_set_min_pstate(struct cpudata *cpu)
@@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s
 		return;
 
 	cpu->pstate.current_pstate = pstate;
-	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
+	intel_pstate_update_perf_ctl(cpu);
 }
 
 static void intel_pstate_adjust_pstate(struct cpudata *cpu)
@@ -2843,11 +2867,9 @@ static void intel_cpufreq_perf_ctl_updat
 					  u32 target_pstate, bool fast_switch)
 {
 	if (fast_switch)
-		wrmsrl(MSR_IA32_PERF_CTL,
-		       pstate_funcs.get_val(cpu, target_pstate));
+		intel_pstate_update_perf_ctl(cpu);
 	else
-		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
+		intel_pstate_update_perf_ctl_on_cpu(cpu);
 }
 
 static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy,
@@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s
 	int old_pstate = cpu->pstate.current_pstate;
 
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	cpu->pstate.current_pstate = target_pstate;
+
 	if (hwp_active) {
 		int max_pstate = policy->strict_target ?
 					target_pstate : cpu->max_perf_ratio;
@@ -2867,8 +2891,6 @@ static int intel_cpufreq_update_pstate(s
 		intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch);
 	}
 
-	cpu->pstate.current_pstate = target_pstate;
-
 	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
 			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
 
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -61,10 +61,10 @@ static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
- * @dummy: not used
+ * @stop_tick: Whether or not to stop the scheduler tick
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-			       struct cpuidle_device *dev, bool *dummy)
+			       struct cpuidle_device *dev, bool *stop_tick)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct ladder_device_state *last_state;
@@ -73,6 +73,8 @@ static int ladder_select_state(struct cp
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	s64 last_residency;
 
+	*stop_tick = !dev->retain_tick;
+
 	/* Special case when user has set very strict latency requirement */
 	if (unlikely(latency_req == 0)) {
 		ladder_do_selection(dev, ldev, last_idx, 0);




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-25 17:45                             ` Rafael J. Wysocki
@ 2022-02-26  0:36                               ` Doug Smythies
  2022-02-28  4:12                                 ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-02-26  0:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Feng Tang, Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm, dsmythies

On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
...
> > > So it looks like a new mechanism is needed for that.
> >
> > If you think idle class is not the right place to solve it, I can
> > also help testing new patches.
>
> So I have the appended experimental patch to address this issue that's not
> been tested at all.  Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine.
It does not compile for me.
The function cpuidle_update_retain_tick does not exist.
Shouldn't it be somewhere in cpuidle.c?
I used the function cpuidle_disable_device as a template
for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3,
that is what I am still using.

Error:
ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl':
intel_pstate.c:(.text+0x2520): undefined reference to
`cpuidle_update_retain_tick'

... Doug

>
> I've been looking for something relatively low-overhead and taking all of the
> dependencies into account.
>
> ---
>  drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
>  drivers/cpuidle/governors/ladder.c |    6 +++--
>  drivers/cpuidle/governors/menu.c   |    2 +
>  drivers/cpuidle/governors/teo.c    |    3 ++
>  include/linux/cpuidle.h            |    4 +++
>  5 files changed, 44 insertions(+), 11 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr
>         if (unlikely(delta < 0)) {
>                 delta = 0;
>                 delta_tick = 0;
> +       } else if (dev->retain_tick) {
> +               delta = delta_tick;
>         }
>         data->next_timer_ns = delta;
>
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -308,6 +308,9 @@ static int teo_select(struct cpuidle_dri
>         cpu_data->time_span_ns = local_clock();
>
>         duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> +       if (dev->retain_tick)
> +               duration_ns = delta_tick;
> +
>         cpu_data->sleep_length_ns = duration_ns;
>
>         /* Check if there is any choice in the first place. */
> Index: linux-pm/include/linux/cpuidle.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpuidle.h
> +++ linux-pm/include/linux/cpuidle.h
> @@ -93,6 +93,7 @@ struct cpuidle_device {
>         unsigned int            registered:1;
>         unsigned int            enabled:1;
>         unsigned int            poll_time_limit:1;
> +       unsigned int            retain_tick:1;
>         unsigned int            cpu;
>         ktime_t                 next_hrtimer;
>
> @@ -172,6 +173,8 @@ extern int cpuidle_play_dead(void);
>  extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
>  static inline struct cpuidle_device *cpuidle_get_device(void)
>  {return __this_cpu_read(cpuidle_devices); }
> +
> +extern void cpuidle_update_retain_tick(bool val);
>  #else
>  static inline void disable_cpuidle(void) { }
>  static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
> @@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void
>  static inline struct cpuidle_driver *cpuidle_get_cpu_driver(
>         struct cpuidle_device *dev) {return NULL; }
>  static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; }
> +static inline void cpuidle_update_retain_tick(bool val) { }
>  #endif
>
>  #ifdef CONFIG_CPU_IDLE
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -19,6 +19,7 @@
>  #include <linux/list.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/fs.h>
> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set
>  }
>  #endif /* CONFIG_ACPI_CPPC_LIB */
>
> +static void intel_pstate_update_perf_ctl(struct cpudata *cpu)
> +{
> +       int pstate = cpu->pstate.current_pstate;
> +
> +       /*
> +        * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
> +        * P-states to prevent them from getting back to the high frequency
> +        * right away after getting out of deep idle.
> +        */
> +       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> +       wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> +}
> +
> +static void intel_pstate_update_perf_ctl_wrapper(void *cpu_data)
> +{
> +       intel_pstate_update_perf_ctl(cpu_data);
> +}
> +
> +static void intel_pstate_update_perf_ctl_on_cpu(struct cpudata *cpu)
> +{
> +       smp_call_function_single(cpu->cpu, intel_pstate_update_perf_ctl_wrapper,
> +                                cpu, 1);
> +}
> +
>  static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
>  {
>         trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> @@ -1979,8 +2004,7 @@ static void intel_pstate_set_pstate(stru
>          * the CPU being updated, so force the register update to run on the
>          * right CPU.
>          */
> -       wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> -                     pstate_funcs.get_val(cpu, pstate));
> +       intel_pstate_update_perf_ctl_on_cpu(cpu);
>  }
>
>  static void intel_pstate_set_min_pstate(struct cpudata *cpu)
> @@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s
>                 return;
>
>         cpu->pstate.current_pstate = pstate;
> -       wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> +       intel_pstate_update_perf_ctl(cpu);
>  }
>
>  static void intel_pstate_adjust_pstate(struct cpudata *cpu)
> @@ -2843,11 +2867,9 @@ static void intel_cpufreq_perf_ctl_updat
>                                           u32 target_pstate, bool fast_switch)
>  {
>         if (fast_switch)
> -               wrmsrl(MSR_IA32_PERF_CTL,
> -                      pstate_funcs.get_val(cpu, target_pstate));
> +               intel_pstate_update_perf_ctl(cpu);
>         else
> -               wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> -                             pstate_funcs.get_val(cpu, target_pstate));
> +               intel_pstate_update_perf_ctl_on_cpu(cpu);
>  }
>
>  static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy,
> @@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s
>         int old_pstate = cpu->pstate.current_pstate;
>
>         target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> +       cpu->pstate.current_pstate = target_pstate;
> +
>         if (hwp_active) {
>                 int max_pstate = policy->strict_target ?
>                                         target_pstate : cpu->max_perf_ratio;
> @@ -2867,8 +2891,6 @@ static int intel_cpufreq_update_pstate(s
>                 intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch);
>         }
>
> -       cpu->pstate.current_pstate = target_pstate;
> -
>         intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
>                             INTEL_PSTATE_TRACE_TARGET, old_pstate);
>
> Index: linux-pm/drivers/cpuidle/governors/ladder.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/ladder.c
> +++ linux-pm/drivers/cpuidle/governors/ladder.c
> @@ -61,10 +61,10 @@ static inline void ladder_do_selection(s
>   * ladder_select_state - selects the next state to enter
>   * @drv: cpuidle driver
>   * @dev: the CPU
> - * @dummy: not used
> + * @stop_tick: Whether or not to stop the scheduler tick
>   */
>  static int ladder_select_state(struct cpuidle_driver *drv,
> -                              struct cpuidle_device *dev, bool *dummy)
> +                              struct cpuidle_device *dev, bool *stop_tick)
>  {
>         struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
>         struct ladder_device_state *last_state;
> @@ -73,6 +73,8 @@ static int ladder_select_state(struct cp
>         s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
>         s64 last_residency;
>
> +       *stop_tick = !dev->retain_tick;
> +
>         /* Special case when user has set very strict latency requirement */
>         if (unlikely(latency_req == 0)) {
>                 ladder_do_selection(dev, ldev, last_idx, 0);
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-26  0:36                               ` Doug Smythies
@ 2022-02-28  4:12                                 ` Feng Tang
  2022-02-28 19:36                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-02-28  4:12 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, srinivas pandruvada, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:
> On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
> ...
> > > > So it looks like a new mechanism is needed for that.
> > >
> > > If you think idle class is not the right place to solve it, I can
> > > also help testing new patches.
> >
> > So I have the appended experimental patch to address this issue that's not
> > been tested at all.  Caveat emptor.
> 
> Hi Rafael,
> 
> O.K., you gave fair warning.
> 
> The patch applied fine.
> It does not compile for me.
> The function cpuidle_update_retain_tick does not exist.
> Shouldn't it be somewhere in cpuidle.c?
> I used the function cpuidle_disable_device as a template
> for searching and comparing.
> 
> Because all of my baseline results are with kernel 5.17-rc3,
> that is what I am still using.
> 
> Error:
> ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl':
> intel_pstate.c:(.text+0x2520): undefined reference to
> `cpuidle_update_retain_tick'
 
Same here, seems the cpuidle_update_retain_tick()'s implementation
is missing.

Thanks,
Feng

> ... Doug
> 
> >
> > I've been looking for something relatively low-overhead and taking all of the
> > dependencies into account.
> >
> > ---
> >  drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
> >  drivers/cpuidle/governors/ladder.c |    6 +++--
> >  drivers/cpuidle/governors/menu.c   |    2 +
> >  drivers/cpuidle/governors/teo.c    |    3 ++
> >  include/linux/cpuidle.h            |    4 +++
> >  5 files changed, 44 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-28  4:12                                 ` Feng Tang
@ 2022-02-28 19:36                                   ` Rafael J. Wysocki
  2022-03-01  5:52                                     ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-02-28 19:36 UTC (permalink / raw)
  To: Doug Smythies, Feng Tang
  Cc: Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm

On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:
> On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:
> > On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
> > ...
> > > > > So it looks like a new mechanism is needed for that.
> > > >
> > > > If you think idle class is not the right place to solve it, I can
> > > > also help testing new patches.
> > >
> > > So I have the appended experimental patch to address this issue that's not
> > > been tested at all.  Caveat emptor.
> > 
> > Hi Rafael,
> > 
> > O.K., you gave fair warning.
> > 
> > The patch applied fine.
> > It does not compile for me.
> > The function cpuidle_update_retain_tick does not exist.
> > Shouldn't it be somewhere in cpuidle.c?
> > I used the function cpuidle_disable_device as a template
> > for searching and comparing.
> > 
> > Because all of my baseline results are with kernel 5.17-rc3,
> > that is what I am still using.
> > 
> > Error:
> > ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl':
> > intel_pstate.c:(.text+0x2520): undefined reference to
> > `cpuidle_update_retain_tick'
>  
> Same here, seems the cpuidle_update_retain_tick()'s implementation
> is missing.

That's a patch generation issue on my part, sorry.

However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

---
 drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
 drivers/cpuidle/governor.c         |   23 +++++++++++++++++++++
 drivers/cpuidle/governors/ladder.c |    6 +++--
 drivers/cpuidle/governors/menu.c   |    2 +
 drivers/cpuidle/governors/teo.c    |    3 ++
 include/linux/cpuidle.h            |    4 +++
 6 files changed, 67 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr
 	if (unlikely(delta < 0)) {
 		delta = 0;
 		delta_tick = 0;
+	} else if (cpuidle_retain_local_tick()) {
+		delta = delta_tick;
 	}
 	data->next_timer_ns = delta;
 
Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -308,6 +308,9 @@ static int teo_select(struct cpuidle_dri
 	cpu_data->time_span_ns = local_clock();
 
 	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+	if (cpuidle_retain_local_tick())
+		duration_ns = delta_tick;
+
 	cpu_data->sleep_length_ns = duration_ns;
 
 	/* Check if there is any choice in the first place. */
Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -172,6 +172,9 @@ extern int cpuidle_play_dead(void);
 extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
 static inline struct cpuidle_device *cpuidle_get_device(void)
 {return __this_cpu_read(cpuidle_devices); }
+
+extern void cpuidle_update_retain_tick(bool val);
+extern bool cpuidle_retain_local_tick(void);
 #else
 static inline void disable_cpuidle(void) { }
 static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
@@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void
 static inline struct cpuidle_driver *cpuidle_get_cpu_driver(
 	struct cpuidle_device *dev) {return NULL; }
 static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; }
+static inline void cpuidle_update_retain_tick(bool val) { }
 #endif
 
 #ifdef CONFIG_CPU_IDLE
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -19,6 +19,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/fs.h>
@@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set
 }
 #endif /* CONFIG_ACPI_CPPC_LIB */
 
+static void intel_pstate_update_perf_ctl(struct cpudata *cpu)
+{
+	int pstate = cpu->pstate.current_pstate;
+
+	/*
+	 * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
+	 * P-states to prevent them from getting back to the high frequency
+	 * right away after getting out of deep idle.
+	 */
+	cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
+	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
+}
+
+static void intel_pstate_update_perf_ctl_wrapper(void *cpu_data)
+{
+	intel_pstate_update_perf_ctl(cpu_data);
+}
+
+static void intel_pstate_update_perf_ctl_on_cpu(struct cpudata *cpu)
+{
+	smp_call_function_single(cpu->cpu, intel_pstate_update_perf_ctl_wrapper,
+				 cpu, 1);
+}
+
 static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
 {
 	trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
@@ -1979,8 +2004,7 @@ static void intel_pstate_set_pstate(stru
 	 * the CPU being updated, so force the register update to run on the
 	 * right CPU.
 	 */
-	wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
-		      pstate_funcs.get_val(cpu, pstate));
+	intel_pstate_update_perf_ctl_on_cpu(cpu);
 }
 
 static void intel_pstate_set_min_pstate(struct cpudata *cpu)
@@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s
 		return;
 
 	cpu->pstate.current_pstate = pstate;
-	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
+	intel_pstate_update_perf_ctl(cpu);
 }
 
 static void intel_pstate_adjust_pstate(struct cpudata *cpu)
@@ -2843,11 +2867,9 @@ static void intel_cpufreq_perf_ctl_updat
 					  u32 target_pstate, bool fast_switch)
 {
 	if (fast_switch)
-		wrmsrl(MSR_IA32_PERF_CTL,
-		       pstate_funcs.get_val(cpu, target_pstate));
+		intel_pstate_update_perf_ctl(cpu);
 	else
-		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
+		intel_pstate_update_perf_ctl_on_cpu(cpu);
 }
 
 static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy,
@@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s
 	int old_pstate = cpu->pstate.current_pstate;
 
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	cpu->pstate.current_pstate = target_pstate;
+
 	if (hwp_active) {
 		int max_pstate = policy->strict_target ?
 					target_pstate : cpu->max_perf_ratio;
@@ -2867,8 +2891,6 @@ static int intel_cpufreq_update_pstate(s
 		intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch);
 	}
 
-	cpu->pstate.current_pstate = target_pstate;
-
 	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
 			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
 
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -61,10 +61,10 @@ static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
- * @dummy: not used
+ * @stop_tick: Whether or not to stop the scheduler tick
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-			       struct cpuidle_device *dev, bool *dummy)
+			       struct cpuidle_device *dev, bool *stop_tick)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct ladder_device_state *last_state;
@@ -73,6 +73,8 @@ static int ladder_select_state(struct cp
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	s64 last_residency;
 
+	*stop_tick = !cpuidle_retain_local_tick();
+
 	/* Special case when user has set very strict latency requirement */
 	if (unlikely(latency_req == 0)) {
 		ladder_do_selection(dev, ldev, last_idx, 0);
Index: linux-pm/drivers/cpuidle/governor.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governor.c
+++ linux-pm/drivers/cpuidle/governor.c
@@ -118,3 +118,26 @@ s64 cpuidle_governor_latency_req(unsigne
 
 	return (s64)device_req * NSEC_PER_USEC;
 }
+
+static DEFINE_PER_CPU(bool, cpuidle_retain_tick);
+
+/**
+ * cpuidle_update_retain_tick - Update the local CPU's retain_tick flag.
+ * @val: New value of the flag.
+ *
+ * The retain_tick flag controls whether or not to cpuidle is allowed to stop
+ * the scheduler tick on the local CPU and it can be updated with the help of
+ * this function.
+ */
+void cpuidle_update_retain_tick(bool val)
+{
+	this_cpu_write(cpuidle_retain_tick, val);
+}
+
+/**
+ * couidle_retain_local_tick - Return the local CPU's retain_tick flag value.
+ */
+bool cpuidle_retain_local_tick(void)
+{
+	return this_cpu_read(cpuidle_retain_tick);
+}





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-02-28 19:36                                   ` Rafael J. Wysocki
@ 2022-03-01  5:52                                     ` Feng Tang
  2022-03-01 11:58                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-03-01  5:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, srinivas pandruvada, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:
> > On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:
> > > On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >
> > > > On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
> > > ...
> > > > > > So it looks like a new mechanism is needed for that.
> > > > >
> > > > > If you think idle class is not the right place to solve it, I can
> > > > > also help testing new patches.
> > > >
> > > > So I have the appended experimental patch to address this issue that's not
> > > > been tested at all.  Caveat emptor.
> > > 
> > > Hi Rafael,
> > > 
> > > O.K., you gave fair warning.
> > > 
> > > The patch applied fine.
> > > It does not compile for me.
> > > The function cpuidle_update_retain_tick does not exist.
> > > Shouldn't it be somewhere in cpuidle.c?
> > > I used the function cpuidle_disable_device as a template
> > > for searching and comparing.
> > > 
> > > Because all of my baseline results are with kernel 5.17-rc3,
> > > that is what I am still using.
> > > 
> > > Error:
> > > ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl':
> > > intel_pstate.c:(.text+0x2520): undefined reference to
> > > `cpuidle_update_retain_tick'
> >  
> > Same here, seems the cpuidle_update_retain_tick()'s implementation
> > is missing.
> 
> That's a patch generation issue on my part, sorry.
> 
> However, it was a bit racy, so maybe it's good that it was not complete.
> 
> Below is a new version.
 
Thanks for the new version. I just gave it a try,  and the occasional
long delay of cpufreq auto-adjusting I have seen can not be reproduced
after applying it.

As my test is a rough one which can't reproduce what Doug has seen
(including the power meter data), it's better to wait for his test result. 

And one minor question for the code.

> ---
>  drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
>  drivers/cpuidle/governor.c         |   23 +++++++++++++++++++++
>  drivers/cpuidle/governors/ladder.c |    6 +++--
>  drivers/cpuidle/governors/menu.c   |    2 +
>  drivers/cpuidle/governors/teo.c    |    3 ++
>  include/linux/cpuidle.h            |    4 +++
>  6 files changed, 67 insertions(+), 11 deletions(-)
>
[SNIP]

> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -19,6 +19,7 @@
>  #include <linux/list.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/fs.h>
> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set
>  }
>  #endif /* CONFIG_ACPI_CPPC_LIB */
>  
> +static void intel_pstate_update_perf_ctl(struct cpudata *cpu)
> +{
> +	int pstate = cpu->pstate.current_pstate;
> +
> +	/*
> +	 * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
> +	 * P-states to prevent them from getting back to the high frequency
> +	 * right away after getting out of deep idle.
> +	 */
> +	cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);

In our test, the workload will make CPU go to highest frequency before
going to idle, but should we also consider that the case that the
high but not highest cupfreq?

Thanks,
Feng

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-01  5:52                                     ` Feng Tang
@ 2022-03-01 11:58                                       ` Rafael J. Wysocki
  2022-03-01 17:18                                         ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-01 11:58 UTC (permalink / raw)
  To: Feng Tang
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	srinivas pandruvada, Zhang, Rui, Thomas Gleixner, paulmck,
	stable, x86, linux-pm

On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:
> > > On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:
> > > > On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > >
> > > > > On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:
> > > > ...
> > > > > > > So it looks like a new mechanism is needed for that.
> > > > > >
> > > > > > If you think idle class is not the right place to solve it, I can
> > > > > > also help testing new patches.
> > > > >
> > > > > So I have the appended experimental patch to address this issue that's not
> > > > > been tested at all.  Caveat emptor.
> > > >
> > > > Hi Rafael,
> > > >
> > > > O.K., you gave fair warning.
> > > >
> > > > The patch applied fine.
> > > > It does not compile for me.
> > > > The function cpuidle_update_retain_tick does not exist.
> > > > Shouldn't it be somewhere in cpuidle.c?
> > > > I used the function cpuidle_disable_device as a template
> > > > for searching and comparing.
> > > >
> > > > Because all of my baseline results are with kernel 5.17-rc3,
> > > > that is what I am still using.
> > > >
> > > > Error:
> > > > ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl':
> > > > intel_pstate.c:(.text+0x2520): undefined reference to
> > > > `cpuidle_update_retain_tick'
> > >
> > > Same here, seems the cpuidle_update_retain_tick()'s implementation
> > > is missing.
> >
> > That's a patch generation issue on my part, sorry.
> >
> > However, it was a bit racy, so maybe it's good that it was not complete.
> >
> > Below is a new version.
>
> Thanks for the new version. I just gave it a try,  and the occasional
> long delay of cpufreq auto-adjusting I have seen can not be reproduced
> after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

> As my test is a rough one which can't reproduce what Doug has seen
> (including the power meter data), it's better to wait for his test result.
>
> And one minor question for the code.
>
> > ---
> >  drivers/cpufreq/intel_pstate.c     |   40 ++++++++++++++++++++++++++++---------
> >  drivers/cpuidle/governor.c         |   23 +++++++++++++++++++++
> >  drivers/cpuidle/governors/ladder.c |    6 +++--
> >  drivers/cpuidle/governors/menu.c   |    2 +
> >  drivers/cpuidle/governors/teo.c    |    3 ++
> >  include/linux/cpuidle.h            |    4 +++
> >  6 files changed, 67 insertions(+), 11 deletions(-)
> >
> [SNIP]
>
> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/list.h>
> >  #include <linux/cpu.h>
> >  #include <linux/cpufreq.h>
> > +#include <linux/cpuidle.h>
> >  #include <linux/sysfs.h>
> >  #include <linux/types.h>
> >  #include <linux/fs.h>
> > @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set
> >  }
> >  #endif /* CONFIG_ACPI_CPPC_LIB */
> >
> > +static void intel_pstate_update_perf_ctl(struct cpudata *cpu)
> > +{
> > +     int pstate = cpu->pstate.current_pstate;
> > +
> > +     /*
> > +      * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
> > +      * P-states to prevent them from getting back to the high frequency
> > +      * right away after getting out of deep idle.
> > +      */
> > +     cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
>
> In our test, the workload will make CPU go to highest frequency before
> going to idle, but should we also consider that the case that the
> high but not highest cupfreq?

This covers the entire turbo range (max_pstate is the highest non-turbo one).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-01 11:58                                       ` Rafael J. Wysocki
@ 2022-03-01 17:18                                         ` Doug Smythies
  2022-03-01 17:34                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-01 17:18 UTC (permalink / raw)
  To: Rafael J. Wysocki, Feng Tang
  Cc: Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm, dsmythies

On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
...
> > >
> > > However, it was a bit racy, so maybe it's good that it was not complete.
> > >
> > > Below is a new version.
> >
> > Thanks for the new version. I just gave it a try,  and the occasional
> > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > after applying it.
>
> OK, thanks!
>
> I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch.
I screwed up an overnight test and will have to re-do it.
However, I do have some results.

From reading the patch code, one worry was the
potential to drive down the desired/required CPU
frequency for the main periodic workflow, causing
overruns, or inability of the task to complete its
work before the next period. I have always had overrun
information, but it has never been relevant before.

The other worry was if the threshold of
turbo/not turbo frequency is enough.

I do not know how to test any final solution
thoroughly, as so far I have simply found a
good enough problematic example.
We have so many years of experience with
the convenient multi second NMI forcing
lingering high pstate clean up. I'd keep it
deciding within it if the TSC stuff needs to be
executed or not.

Anyway...

Base Kernel 5.17-rc3.
"stock" : unmodified.
"revert" : with commit b50db7095fe reverted
"rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.):
347 Hertz work/sleep frequency on one CPU while others do
virtually no load but enough to increase the requested pstate,
but at around 0.02 hertz aggregate.

It is important to note the main load is approximately
38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
and almost exclusively uses idle state 2 (of
4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per.
Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert:
ave: 3.00
min: 2.89
max: 3.08
ave freq: 2.422 GHz.
overruns: 1.
max overrun time: 113 uSec.

stock:
ave: 3.63 (+21%)
min: 3.28
max: 3.99
ave freq: 2.791 GHz.
overruns: 2.
max overrun time: 677 uSec.

rjw-2:
ave: 3.14 (+5%)
min: 2.97
max: 3.28
ave freq: 2.635 GHz
overruns: 1042.
max overrun time: 9,769 uSec.

... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-01 17:18                                         ` Doug Smythies
@ 2022-03-01 17:34                                           ` Rafael J. Wysocki
  2022-03-02  4:06                                             ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-01 17:34 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Feng Tang, Rafael J. Wysocki,
	srinivas pandruvada, Zhang, Rui, Thomas Gleixner, paulmck,
	stable, x86, linux-pm

On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> ...
> > > >
> > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > >
> > > > Below is a new version.
> > >
> > > Thanks for the new version. I just gave it a try,  and the occasional
> > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > after applying it.
> >
> > OK, thanks!
> >
> > I'll wait for feedback from Dough, though.
>
> Hi Rafael,
>
> Thank you for your version 2 patch.
> I screwed up an overnight test and will have to re-do it.
> However, I do have some results.

Thanks for testing it!

> From reading the patch code, one worry was the
> potential to drive down the desired/required CPU
> frequency for the main periodic workflow, causing
> overruns, or inability of the task to complete its
> work before the next period.

It is not clear to me why you worried about that just from reading the
patch?  Can you explain, please?

> I have always had overrun
> information, but it has never been relevant before.
>
> The other worry was if the threshold of
> turbo/not turbo frequency is enough.

Agreed.

> I do not know how to test any final solution
> thoroughly, as so far I have simply found a
> good enough problematic example.
> We have so many years of experience with
> the convenient multi second NMI forcing
> lingering high pstate clean up. I'd keep it
> deciding within it if the TSC stuff needs to be
> executed or not.
>
> Anyway...
>
> Base Kernel 5.17-rc3.
> "stock" : unmodified.
> "revert" : with commit b50db7095fe reverted
> "rjw-2" : with this version2 patch added.
>
> Test 1 (as before. There is no test 2, yet.):
> 347 Hertz work/sleep frequency on one CPU while others do
> virtually no load but enough to increase the requested pstate,
> but at around 0.02 hertz aggregate.
>
> It is important to note the main load is approximately
> 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> and almost exclusively uses idle state 2 (of
> 4 total idle states)
>
> /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
>
> Turbostat was used. ~10 samples at 300 seconds per.
> Processor package power (Watts):
>
> Workflow was run for 1 hour each time or 1249201 loops.
>
> revert:
> ave: 3.00
> min: 2.89
> max: 3.08

I'm not sure what the above three numbers are.

> ave freq: 2.422 GHz.
> overruns: 1.
> max overrun time: 113 uSec.
>
> stock:
> ave: 3.63 (+21%)
> min: 3.28
> max: 3.99
> ave freq: 2.791 GHz.
> overruns: 2.
> max overrun time: 677 uSec.
>
> rjw-2:
> ave: 3.14 (+5%)
> min: 2.97
> max: 3.28
> ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state
below the max non-turbo one as a limit.

> overruns: 1042.
> max overrun time: 9,769 uSec.

This would probably get worse then, though.

ATM I'm not quite sure why this happens, but you seem to have some
insight into it, so it would help if you shared it.

Thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-01 17:34                                           ` Rafael J. Wysocki
@ 2022-03-02  4:06                                             ` Doug Smythies
  2022-03-02 19:00                                               ` Rafael J. Wysocki
  2022-03-03  5:27                                               ` Feng Tang
  0 siblings, 2 replies; 47+ messages in thread
From: Doug Smythies @ 2022-03-02  4:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Feng Tang, Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm, dsmythies

On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@telus.net> wrote:
> > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > ...
> > > > >
> > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > >
> > > > > Below is a new version.
> > > >
> > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > after applying it.
> > >
> > > OK, thanks!
> > >
> > > I'll wait for feedback from Dough, though.
> >
> > Hi Rafael,
> >
> > Thank you for your version 2 patch.
> > I screwed up an overnight test and will have to re-do it.
> > However, I do have some results.
>
> Thanks for testing it!
>
> > From reading the patch code, one worry was the
> > potential to drive down the desired/required CPU
> > frequency for the main periodic workflow, causing
> > overruns, or inability of the task to complete its
> > work before the next period.
>
> It is not clear to me why you worried about that just from reading the
> patch?  Can you explain, please?

It is already covered below. And a couple of further tests
directly contradict my thinking.

> > I have always had overrun
> > information, but it has never been relevant before.
> >
> > The other worry was if the threshold of
> > turbo/not turbo frequency is enough.
>
> Agreed.

Just as an easy example and test I did this on
top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index f878a4545eee..5cbdd7e479e8 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
cpudata *cpu)
         * P-states to prevent them from getting back to the high frequency
         * right away after getting out of deep idle.
         */
-       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
+       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
        wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
 }

> > I do not know how to test any final solution
> > thoroughly, as so far I have simply found a
> > good enough problematic example.
> > We have so many years of experience with
> > the convenient multi second NMI forcing
> > lingering high pstate clean up. I'd keep it
> > deciding within it if the TSC stuff needs to be
> > executed or not.
> >
> > Anyway...
> >
> > Base Kernel 5.17-rc3.
> > "stock" : unmodified.
> > "revert" : with commit b50db7095fe reverted
> > "rjw-2" : with this version2 patch added.
> >
> > Test 1 (as before. There is no test 2, yet.):
> > 347 Hertz work/sleep frequency on one CPU while others do
> > virtually no load but enough to increase the requested pstate,
> > but at around 0.02 hertz aggregate.
> >
> > It is important to note the main load is approximately
> > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > and almost exclusively uses idle state 2 (of
> > 4 total idle states)
> >
> > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> >
> > Turbostat was used. ~10 samples at 300 seconds per.
> > Processor package power (Watts):
> >
> > Workflow was run for 1 hour each time or 1249201 loops.
> >
> > revert:
> > ave: 3.00
> > min: 2.89
> > max: 3.08
>
> I'm not sure what the above three numbers are.

Processor package power, Watts.
>
> > ave freq: 2.422 GHz.
> > overruns: 1.
> > max overrun time: 113 uSec.
> >
> > stock:
> > ave: 3.63 (+21%)
> > min: 3.28
> > max: 3.99
> > ave freq: 2.791 GHz.
> > overruns: 2.
> > max overrun time: 677 uSec.
> >
> > rjw-2:
> > ave: 3.14 (+5%)
> > min: 2.97
> > max: 3.28
> > ave freq: 2.635 GHz
>
> I guess the numbers above could be reduced still by using a P-state
> below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

> > overruns: 1042.
> > max overrun time: 9,769 uSec.
>
> This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3:
ave: 3.09 watts
min: 3.01 watts
max: 31.7 watts
ave freq: 2.42 GHz.
overruns: 12. (I did not expect this.)
Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in
and out of idle a lot more.

Note 2: We know that processor package power
is highly temperature dependent. I forgot to let my
coolant cool adequately after the kernel compile,
and so had to throw out the first 4 power samples
(20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests
and got 0 overruns in both cases.

> ATM I'm not quite sure why this happens, but you seem to have some
> insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go
down or not based on such a localized look. The risk being that the
higher periodic load might suffer overruns. Since my first test did exactly
that, I violated my own "repeat all tests 3 times before reporting rule".
Now, I am not sure what is going on.
I will need more time to acquire traces and dig into it.

I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
and saw several long durations. This was expected as this patch set
wouldn't change durations by more than a few jiffies.
755 long durations (>6.1 seconds), and 327.7 seconds longest.

... Doug

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-02  4:06                                             ` Doug Smythies
@ 2022-03-02 19:00                                               ` Rafael J. Wysocki
  2022-03-03 23:00                                                 ` Doug Smythies
  2022-03-03  5:27                                               ` Feng Tang
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-02 19:00 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Feng Tang, Rafael J. Wysocki,
	srinivas pandruvada, Zhang, Rui, Thomas Gleixner, paulmck,
	stable, x86, linux-pm

On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > > ...
> > > > > >
> > > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > > >
> > > > > > Below is a new version.
> > > > >
> > > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > > after applying it.
> > > >
> > > > OK, thanks!
> > > >
> > > > I'll wait for feedback from Dough, though.
> > >
> > > Hi Rafael,
> > >
> > > Thank you for your version 2 patch.
> > > I screwed up an overnight test and will have to re-do it.
> > > However, I do have some results.
> >
> > Thanks for testing it!
> >
> > > From reading the patch code, one worry was the
> > > potential to drive down the desired/required CPU
> > > frequency for the main periodic workflow, causing
> > > overruns, or inability of the task to complete its
> > > work before the next period.
> >
> > It is not clear to me why you worried about that just from reading the
> > patch?  Can you explain, please?
>
> It is already covered below. And a couple of further tests
> directly contradict my thinking.
>
> > > I have always had overrun
> > > information, but it has never been relevant before.
> > >
> > > The other worry was if the threshold of
> > > turbo/not turbo frequency is enough.
> >
> > Agreed.
>
> Just as an easy example and test I did this on
> top of this patch ("rjw-3"):
>
> doug@s19:~/kernel/linux$ git diff
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index f878a4545eee..5cbdd7e479e8 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> cpudata *cpu)
>          * P-states to prevent them from getting back to the high frequency
>          * right away after getting out of deep idle.
>          */
> -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> +       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));

OK, but cpu->pstate.max_pstate / 2 may be almost
cpu->pstate.min_pstate which would be better to use here IMO.

Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) /
2.  Can you try this one in particular?

>         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
>  }
>
> > > I do not know how to test any final solution
> > > thoroughly, as so far I have simply found a
> > > good enough problematic example.
> > > We have so many years of experience with
> > > the convenient multi second NMI forcing
> > > lingering high pstate clean up. I'd keep it
> > > deciding within it if the TSC stuff needs to be
> > > executed or not.
> > >
> > > Anyway...
> > >
> > > Base Kernel 5.17-rc3.
> > > "stock" : unmodified.
> > > "revert" : with commit b50db7095fe reverted
> > > "rjw-2" : with this version2 patch added.
> > >
> > > Test 1 (as before. There is no test 2, yet.):
> > > 347 Hertz work/sleep frequency on one CPU while others do
> > > virtually no load but enough to increase the requested pstate,
> > > but at around 0.02 hertz aggregate.
> > >
> > > It is important to note the main load is approximately
> > > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > > and almost exclusively uses idle state 2 (of
> > > 4 total idle states)
> > >
> > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> > >
> > > Turbostat was used. ~10 samples at 300 seconds per.
> > > Processor package power (Watts):
> > >
> > > Workflow was run for 1 hour each time or 1249201 loops.
> > >
> > > revert:
> > > ave: 3.00
> > > min: 2.89
> > > max: 3.08
> >
> > I'm not sure what the above three numbers are.
>
> Processor package power, Watts.

OK

> > > ave freq: 2.422 GHz.
> > > overruns: 1.
> > > max overrun time: 113 uSec.
> > >
> > > stock:
> > > ave: 3.63 (+21%)
> > > min: 3.28
> > > max: 3.99
> > > ave freq: 2.791 GHz.
> > > overruns: 2.
> > > max overrun time: 677 uSec.
> > >
> > > rjw-2:
> > > ave: 3.14 (+5%)
> > > min: 2.97
> > > max: 3.28
> > > ave freq: 2.635 GHz
> >
> > I guess the numbers above could be reduced still by using a P-state
> > below the max non-turbo one as a limit.
>
> Yes, and for a test I did "rjw-3".
>
> > > overruns: 1042.
> > > max overrun time: 9,769 uSec.
> >
> > This would probably get worse then, though.
>
> Yes, that was my expectation, but not what happened.
>
> rjw-3:
> ave: 3.09 watts
> min: 3.01 watts
> max: 31.7 watts

Did you mean 3.17 here?  It would be hard to get the average of 3.09
if the max was over 30 W.

> ave freq: 2.42 GHz.
> overruns: 12. (I did not expect this.)
> Max overruns time: 621 uSec.
>
> Note 1: IRQ's increased by 74%. i.e. it was going in
> and out of idle a lot more.

That's because the scheduler tick is allowed to run a lot more often
in the given workload with the changed test above.

> Note 2: We know that processor package power
> is highly temperature dependent. I forgot to let my
> coolant cool adequately after the kernel compile,
> and so had to throw out the first 4 power samples
> (20 minutes).
>
> I retested both rjw-2 and rjw-3, but shorter tests
> and got 0 overruns in both cases.
>
> > ATM I'm not quite sure why this happens, but you seem to have some
> > insight into it, so it would help if you shared it.
>
> My insight seems questionable.
>
> My thinking was that one can not decide if the pstate needs to go
> down or not based on such a localized look. The risk being that the
> higher periodic load might suffer overruns. Since my first test did exactly
> that, I violated my own "repeat all tests 3 times before reporting rule".
> Now, I am not sure what is going on.
> I will need more time to acquire traces and dig into it.

It looks like the workload's behavior depends on updating the CPU
performance scaling governor sufficiently often.

Previously, that happened through the watchdog workflow that is gone
now.  The rjw-2/3 patch is attempting to make up for that by letting
the tick run more often.

Admittedly, it is somewhat unclear to me why there are not so many
overruns in the "stock" kernel test.  Perhaps that's because the CPUs
generally run fast enough in that case, so the P-state governor need
not be updated so often for the CPU frequency to stay high and that's
what determines the performance (and in turn that decides whether or
not there are overruns and how often they occur).  The other side of
the coin is that the updates don't occur often enough for the power
draw to be reasonable, though.

Presumably, when the P-state governor gets updated more often (via the
scheduler tick), but not often enough, it effectively causes the
frequency (and consequently the performance) to drop over relatively
long time intervals (and hence the increased occurrence of overruns).
If it gets updated even more often (like in rjw-3), though, it causes
the CPU frequency to follow the utilization more precisely and so the
CPUs don't run "too fast" or "too slowly" for too long.

> I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> and saw several long durations. This was expected as this patch set
> wouldn't change durations by more than a few jiffies.
> 755 long durations (>6.1 seconds), and 327.7 seconds longest.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-02  4:06                                             ` Doug Smythies
  2022-03-02 19:00                                               ` Rafael J. Wysocki
@ 2022-03-03  5:27                                               ` Feng Tang
  2022-03-03 12:02                                                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-03-03  5:27 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, srinivas pandruvada, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm

On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > I guess the numbers above could be reduced still by using a P-state
> > below the max non-turbo one as a limit.
> 
> Yes, and for a test I did "rjw-3".
> 
> > > overruns: 1042.
> > > max overrun time: 9,769 uSec.
> >
> > This would probably get worse then, though.
> 
> Yes, that was my expectation, but not what happened.
> 
> rjw-3:
> ave: 3.09 watts
> min: 3.01 watts
> max: 31.7 watts
> ave freq: 2.42 GHz.
> overruns: 12. (I did not expect this.)
> Max overruns time: 621 uSec.
> 
> Note 1: IRQ's increased by 74%. i.e. it was going in
> and out of idle a lot more.
> 
> Note 2: We know that processor package power
> is highly temperature dependent. I forgot to let my
> coolant cool adequately after the kernel compile,
> and so had to throw out the first 4 power samples
> (20 minutes).
> 
> I retested both rjw-2 and rjw-3, but shorter tests
> and got 0 overruns in both cases.
 
One thought is can we consider trying the previous debug patch of
calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call
cpufreq_update_util(), the patch will make sure it is called in all
four classes, also it follows the principle of 'schedutil' of not
introducing more system cost. And surely I could be missing some
details here.

Following is a cleaner version of the patch, and the code could be
moved down to the internal loop of

	while (!need_resched()) {

	}

Which will make it get called more frequently.

---

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d17b0a5ce6ac..e12688036725 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
  *
  * Called with polling cleared.
  */
+DEFINE_PER_CPU(u64, last_util_update_time);	/* in jiffies */
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	u64 expire;
 
 	/*
 	 * Check if we need to update blocked load
 	 */
 	nohz_run_idle_balance(cpu);
 
+	expire = __this_cpu_read(last_util_update_time) + HZ * 3;
+	if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
+		cpufreq_update_util(this_rq(), 0);
+		__this_cpu_write(last_util_update_time, get_jiffies_64());
+	}
+
 	/*
 	 * If the arch has a polling bit, we maintain an invariant:
 	 *

Thanks,
Feng

> > ATM I'm not quite sure why this happens, but you seem to have some
> > insight into it, so it would help if you shared it.
> 
> My insight seems questionable.
> 
> My thinking was that one can not decide if the pstate needs to go
> down or not based on such a localized look. The risk being that the
> higher periodic load might suffer overruns. Since my first test did exactly
> that, I violated my own "repeat all tests 3 times before reporting rule".
> Now, I am not sure what is going on.
> I will need more time to acquire traces and dig into it.
> 
> I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> and saw several long durations. This was expected as this patch set
> wouldn't change durations by more than a few jiffies.
> 755 long durations (>6.1 seconds), and 327.7 seconds longest.
> 
> ... Doug

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-03  5:27                                               ` Feng Tang
@ 2022-03-03 12:02                                                 ` Rafael J. Wysocki
  2022-03-04  5:13                                                   ` Feng Tang
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-03 12:02 UTC (permalink / raw)
  To: Feng Tang
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	srinivas pandruvada, Zhang, Rui, Thomas Gleixner, paulmck,
	stable, x86, linux-pm

On Thu, Mar 3, 2022 at 6:27 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > I guess the numbers above could be reduced still by using a P-state
> > > below the max non-turbo one as a limit.
> >
> > Yes, and for a test I did "rjw-3".
> >
> > > > overruns: 1042.
> > > > max overrun time: 9,769 uSec.
> > >
> > > This would probably get worse then, though.
> >
> > Yes, that was my expectation, but not what happened.
> >
> > rjw-3:
> > ave: 3.09 watts
> > min: 3.01 watts
> > max: 31.7 watts
> > ave freq: 2.42 GHz.
> > overruns: 12. (I did not expect this.)
> > Max overruns time: 621 uSec.
> >
> > Note 1: IRQ's increased by 74%. i.e. it was going in
> > and out of idle a lot more.
> >
> > Note 2: We know that processor package power
> > is highly temperature dependent. I forgot to let my
> > coolant cool adequately after the kernel compile,
> > and so had to throw out the first 4 power samples
> > (20 minutes).
> >
> > I retested both rjw-2 and rjw-3, but shorter tests
> > and got 0 overruns in both cases.
>
> One thought is can we consider trying the previous debug patch of
> calling the util_update when entering idle (time limited).
>
> In current code, the RT/CFS/Deadline class all have places to call
> cpufreq_update_util(), the patch will make sure it is called in all
> four classes, also it follows the principle of 'schedutil' of not
> introducing more system cost. And surely I could be missing some
> details here.
>
> Following is a cleaner version of the patch, and the code could be
> moved down to the internal loop of
>
>         while (!need_resched()) {
>
>         }
>
> Which will make it get called more frequently.

It will, but it's not necessary in all cases.  It only is necessary if
the tick is going to be stopped (because the tick will update the
P-state governor anyway if it runs).  However, at this point you don't
know what's going to happen to the tick.

Moreover, updating the P-state governor before going idle doesn't
really help, because the P-state programmed by it at this point may
very well be stale after getting out of the idle state, so instead of
doing a full update at this point, it should force a low P-state on
the way from idle.

> ---
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index d17b0a5ce6ac..e12688036725 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
>   *
>   * Called with polling cleared.
>   */
> +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
>  static void do_idle(void)
>  {
>         int cpu = smp_processor_id();
> +       u64 expire;
>
>         /*
>          * Check if we need to update blocked load
>          */
>         nohz_run_idle_balance(cpu);
>
> +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> +       if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
> +               cpufreq_update_util(this_rq(), 0);

And quite frankly I'm not sure if running cpufreq_update_util() from
here is safe.

> +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> +       }
> +
>         /*
>          * If the arch has a polling bit, we maintain an invariant:
>          *
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-02 19:00                                               ` Rafael J. Wysocki
@ 2022-03-03 23:00                                                 ` Doug Smythies
  2022-03-04  6:59                                                   ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-03 23:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Feng Tang, Rafael J. Wysocki, srinivas pandruvada, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm, dsmythies

On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies <dsmythies@telus.net> wrote:
> > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > > > ...
> > > > > > >
> > > > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > > > >
> > > > > > > Below is a new version.
> > > > > >
> > > > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > > > after applying it.
> > > > >
> > > > > OK, thanks!
> > > > >
> > > > > I'll wait for feedback from Dough, though.
> > > >
> > > > Hi Rafael,
> > > >
> > > > Thank you for your version 2 patch.
> > > > I screwed up an overnight test and will have to re-do it.
> > > > However, I do have some results.
> > >
> > > Thanks for testing it!
> > >
> > > > From reading the patch code, one worry was the
> > > > potential to drive down the desired/required CPU
> > > > frequency for the main periodic workflow, causing
> > > > overruns, or inability of the task to complete its
> > > > work before the next period.
> > >
> > > It is not clear to me why you worried about that just from reading the
> > > patch?  Can you explain, please?
> >
> > It is already covered below. And a couple of further tests
> > directly contradict my thinking.
> >
> > > > I have always had overrun
> > > > information, but it has never been relevant before.
> > > >
> > > > The other worry was if the threshold of
> > > > turbo/not turbo frequency is enough.
> > >
> > > Agreed.
> >
> > Just as an easy example and test I did this on
> > top of this patch ("rjw-3"):
> >
> > doug@s19:~/kernel/linux$ git diff
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index f878a4545eee..5cbdd7e479e8 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > cpudata *cpu)
> >          * P-states to prevent them from getting back to the high frequency
> >          * right away after getting out of deep idle.
> >          */
> > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> > +       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
>
> OK, but cpu->pstate.max_pstate / 2 may be almost
> cpu->pstate.min_pstate which would be better to use here IMO.

For my processor, i5-10600K, it works out to 2.05 GHz.
4.1 GHz / 2, whereas min pstate would be 0.8 GHz

> Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) /
> 2.  Can you try this one in particular?

Which I agree would be a much better general case to use.
For my processor that would be 2.45 GHz.
(4.1 + 0.8) / 2.

My code was just for testing, and I intended to go further than
we might want to for the final solution.

I'm holding off on trying your suggestion, for now.

> >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> >  }
> >
> > > > I do not know how to test any final solution
> > > > thoroughly, as so far I have simply found a
> > > > good enough problematic example.
> > > > We have so many years of experience with
> > > > the convenient multi second NMI forcing
> > > > lingering high pstate clean up. I'd keep it
> > > > deciding within it if the TSC stuff needs to be
> > > > executed or not.
> > > >
> > > > Anyway...
> > > >
> > > > Base Kernel 5.17-rc3.
> > > > "stock" : unmodified.
> > > > "revert" : with commit b50db7095fe reverted
> > > > "rjw-2" : with this version2 patch added.
> > > >
> > > > Test 1 (as before. There is no test 2, yet.):
> > > > 347 Hertz work/sleep frequency on one CPU while others do
> > > > virtually no load but enough to increase the requested pstate,
> > > > but at around 0.02 hertz aggregate.
> > > >
> > > > It is important to note the main load is approximately
> > > > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > > > and almost exclusively uses idle state 2 (of
> > > > 4 total idle states)
> > > >
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> > > >
> > > > Turbostat was used. ~10 samples at 300 seconds per.
> > > > Processor package power (Watts):
> > > >
> > > > Workflow was run for 1 hour each time or 1249201 loops.
> > > >
> > > > revert:
> > > > ave: 3.00
> > > > min: 2.89
> > > > max: 3.08
> > >
> > > I'm not sure what the above three numbers are.
> >
> > Processor package power, Watts.
>
> OK
>
> > > > ave freq: 2.422 GHz.
> > > > overruns: 1.
> > > > max overrun time: 113 uSec.
> > > >
> > > > stock:
> > > > ave: 3.63 (+21%)
> > > > min: 3.28
> > > > max: 3.99
> > > > ave freq: 2.791 GHz.
> > > > overruns: 2.
> > > > max overrun time: 677 uSec.
> > > >
> > > > rjw-2:
> > > > ave: 3.14 (+5%)
> > > > min: 2.97
> > > > max: 3.28
> > > > ave freq: 2.635 GHz
> > >
> > > I guess the numbers above could be reduced still by using a P-state
> > > below the max non-turbo one as a limit.
> >
> > Yes, and for a test I did "rjw-3".
> >
> > > > overruns: 1042.
> > > > max overrun time: 9,769 uSec.
> > >
> > > This would probably get worse then, though.
> >
> > Yes, that was my expectation, but not what happened.
> >
> > rjw-3:
> > ave: 3.09 watts
> > min: 3.01 watts
> > max: 31.7 watts
>
> Did you mean 3.17 here?  It would be hard to get the average of 3.09
> if the max was over 30 W.

Yes, 3.17 watts was what I meant to write. Sorry for any confusion.

> > ave freq: 2.42 GHz.
> > overruns: 12. (I did not expect this.)
> > Max overruns time: 621 uSec.
> >
> > Note 1: IRQ's increased by 74%. i.e. it was going in
> > and out of idle a lot more.
>
> That's because the scheduler tick is allowed to run a lot more often
> in the given workload with the changed test above.

Agreed.

> > Note 2: We know that processor package power
> > is highly temperature dependent. I forgot to let my
> > coolant cool adequately after the kernel compile,
> > and so had to throw out the first 4 power samples
> > (20 minutes).
> >
> > I retested both rjw-2 and rjw-3, but shorter tests
> > and got 0 overruns in both cases.
> >
> > > ATM I'm not quite sure why this happens, but you seem to have some
> > > insight into it, so it would help if you shared it.
> >
> > My insight seems questionable.
> >
> > My thinking was that one can not decide if the pstate needs to go
> > down or not based on such a localized look. The risk being that the
> > higher periodic load might suffer overruns. Since my first test did exactly
> > that, I violated my own "repeat all tests 3 times before reporting rule".
> > Now, I am not sure what is going on.
> > I will need more time to acquire traces and dig into it.
>
> It looks like the workload's behavior depends on updating the CPU
> performance scaling governor sufficiently often.
>
> Previously, that happened through the watchdog workflow that is gone
> now.  The rjw-2/3 patch is attempting to make up for that by letting
> the tick run more often.
>
> Admittedly, it is somewhat unclear to me why there are not so many
> overruns in the "stock" kernel test.  Perhaps that's because the CPUs
> generally run fast enough in that case, so the P-state governor need
> not be updated so often for the CPU frequency to stay high and that's
> what determines the performance (and in turn that decides whether or
> not there are overruns and how often they occur).  The other side of
> the coin is that the updates don't occur often enough for the power
> draw to be reasonable, though.

Agreed.

I am observing higher CPU frequencies than expected, and can not
determine why in all cases. It is going to take several days before
I can either reply with more detail or disprove what I think I am observing.

> Presumably, when the P-state governor gets updated more often (via the
> scheduler tick), but not often enough, it effectively causes the
> frequency (and consequently the performance) to drop over relatively
> long time intervals (and hence the increased occurrence of overruns).
> If it gets updated even more often (like in rjw-3), though, it causes
> the CPU frequency to follow the utilization more precisely and so the
> CPUs don't run "too fast" or "too slowly" for too long.

Agreed.

> > I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> > and saw several long durations. This was expected as this patch set
> > wouldn't change durations by more than a few jiffies.
> > 755 long durations (>6.1 seconds), and 327.7 seconds longest.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-03 12:02                                                 ` Rafael J. Wysocki
@ 2022-03-04  5:13                                                   ` Feng Tang
  2022-03-04 16:23                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Feng Tang @ 2022-03-04  5:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, srinivas pandruvada, Zhang,
	Rui, Thomas Gleixner, paulmck, stable, x86, linux-pm, tim.c.chen

On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 3, 2022 at 6:27 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > I guess the numbers above could be reduced still by using a P-state
> > > > below the max non-turbo one as a limit.
> > >
> > > Yes, and for a test I did "rjw-3".
> > >
> > > > > overruns: 1042.
> > > > > max overrun time: 9,769 uSec.
> > > >
> > > > This would probably get worse then, though.
> > >
> > > Yes, that was my expectation, but not what happened.
> > >
> > > rjw-3:
> > > ave: 3.09 watts
> > > min: 3.01 watts
> > > max: 31.7 watts
> > > ave freq: 2.42 GHz.
> > > overruns: 12. (I did not expect this.)
> > > Max overruns time: 621 uSec.
> > >
> > > Note 1: IRQ's increased by 74%. i.e. it was going in
> > > and out of idle a lot more.
> > >
> > > Note 2: We know that processor package power
> > > is highly temperature dependent. I forgot to let my
> > > coolant cool adequately after the kernel compile,
> > > and so had to throw out the first 4 power samples
> > > (20 minutes).
> > >
> > > I retested both rjw-2 and rjw-3, but shorter tests
> > > and got 0 overruns in both cases.
> >
> > One thought is can we consider trying the previous debug patch of
> > calling the util_update when entering idle (time limited).
> >
> > In current code, the RT/CFS/Deadline class all have places to call
> > cpufreq_update_util(), the patch will make sure it is called in all
> > four classes, also it follows the principle of 'schedutil' of not
> > introducing more system cost. And surely I could be missing some
> > details here.
> >
> > Following is a cleaner version of the patch, and the code could be
> > moved down to the internal loop of
> >
> >         while (!need_resched()) {
> >
> >         }
> >
> > Which will make it get called more frequently.
> 
> It will, but it's not necessary in all cases.  It only is necessary if
> the tick is going to be stopped (because the tick will update the
> P-state governor anyway if it runs).  However, at this point you don't
> know what's going to happen to the tick.
> 
> Moreover, updating the P-state governor before going idle doesn't
> really help,

From Doug's previous test, the power consumption and the delay
both improved with the debug patch.

> because the P-state programmed by it at this point may
> very well be stale after getting out of the idle state, so instead of
> doing a full update at this point, it should force a low P-state on
> the way from idle.

Makes sense.

Paul has asked about the timer interupts, and here is some more info,
when there is no active load in system, the longest interval of 4
seconds between 2 timer interrupts comes from the kernel watchdog
for hardware/software lockup detector, and every CPU will have this
timer, which limits the maximum cpu idle duration to be 4 seconds.

> > ---
> >
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index d17b0a5ce6ac..e12688036725 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
> >   *
> >   * Called with polling cleared.
> >   */
> > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> >  static void do_idle(void)
> >  {
> >         int cpu = smp_processor_id();
> > +       u64 expire;
> >
> >         /*
> >          * Check if we need to update blocked load
> >          */
> >         nohz_run_idle_balance(cpu);
> >
> > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > +       if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
> > +               cpufreq_update_util(this_rq(), 0);
> 
> And quite frankly I'm not sure if running cpufreq_update_util() from
> here is safe.

I had that concern too :). Do you mean this is called when the local
irq is enabled, and could be interrupted causing some issue?

Thanks,
Feng

> > +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> > +       }
> > +
> >         /*
> >          * If the arch has a polling bit, we maintain an invariant:
> >          *
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-03 23:00                                                 ` Doug Smythies
@ 2022-03-04  6:59                                                   ` Doug Smythies
  2022-03-16 15:54                                                     ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-04  6:59 UTC (permalink / raw)
  To: Rafael J. Wysocki, srinivas pandruvada, Feng Tang
  Cc: Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner, paulmck, stable,
	x86, linux-pm, dsmythies

On Thu, Mar 3, 2022 at 3:00 PM Doug Smythies <dsmythies@telus.net> wrote:
> On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies <dsmythies@telus.net> wrote:
> > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@intel.com> wrote:
> > > > > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > > > > ...
> > > > > > > >
> > > > > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > > > > >
> > > > > > > > Below is a new version.
> > > > > > >
> > > > > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > > > > after applying it.
> > > > > >
> > > > > > OK, thanks!
> > > > > >
> > > > > > I'll wait for feedback from Dough, though.
> > > > >
> > > > > Hi Rafael,
> > > > >
> > > > > Thank you for your version 2 patch.
> > > > > I screwed up an overnight test and will have to re-do it.
> > > > > However, I do have some results.
> > > >
> > > > Thanks for testing it!
> > > >
> > > > > From reading the patch code, one worry was the
> > > > > potential to drive down the desired/required CPU
> > > > > frequency for the main periodic workflow, causing
> > > > > overruns, or inability of the task to complete its
> > > > > work before the next period.
> > > >
> > > > It is not clear to me why you worried about that just from reading the
> > > > patch?  Can you explain, please?
> > >
> > > It is already covered below. And a couple of further tests
> > > directly contradict my thinking.
> > >
> > > > > I have always had overrun
> > > > > information, but it has never been relevant before.
> > > > >
> > > > > The other worry was if the threshold of
> > > > > turbo/not turbo frequency is enough.
> > > >
> > > > Agreed.
> > >
> > > Just as an easy example and test I did this on
> > > top of this patch ("rjw-3"):
> > >
> > > doug@s19:~/kernel/linux$ git diff
> > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > > index f878a4545eee..5cbdd7e479e8 100644
> > > --- a/drivers/cpufreq/intel_pstate.c
> > > +++ b/drivers/cpufreq/intel_pstate.c
> > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > > cpudata *cpu)
> > >          * P-states to prevent them from getting back to the high frequency
> > >          * right away after getting out of deep idle.
> > >          */
> > > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> > > +       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
> >
> > OK, but cpu->pstate.max_pstate / 2 may be almost
> > cpu->pstate.min_pstate which would be better to use here IMO.
>
> For my processor, i5-10600K, it works out to 2.05 GHz.
> 4.1 GHz / 2, whereas min pstate would be 0.8 GHz
>
> > Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) /
> > 2.  Can you try this one in particular?
>
> Which I agree would be a much better general case to use.
> For my processor that would be 2.45 GHz.
> (4.1 + 0.8) / 2.
>
> My code was just for testing, and I intended to go further than
> we might want to for the final solution.
>
> I'm holding off on trying your suggestion, for now.
>
> > >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> > >  }
> > >
> > > > > I do not know how to test any final solution
> > > > > thoroughly, as so far I have simply found a
> > > > > good enough problematic example.
> > > > > We have so many years of experience with
> > > > > the convenient multi second NMI forcing
> > > > > lingering high pstate clean up. I'd keep it
> > > > > deciding within it if the TSC stuff needs to be
> > > > > executed or not.
> > > > >
> > > > > Anyway...
> > > > >
> > > > > Base Kernel 5.17-rc3.
> > > > > "stock" : unmodified.
> > > > > "revert" : with commit b50db7095fe reverted
> > > > > "rjw-2" : with this version2 patch added.
> > > > >
> > > > > Test 1 (as before. There is no test 2, yet.):
> > > > > 347 Hertz work/sleep frequency on one CPU while others do
> > > > > virtually no load but enough to increase the requested pstate,
> > > > > but at around 0.02 hertz aggregate.
> > > > >
> > > > > It is important to note the main load is approximately
> > > > > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > > > > and almost exclusively uses idle state 2 (of
> > > > > 4 total idle states)
> > > > >
> > > > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > > > > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > > > > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > > > > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> > > > >
> > > > > Turbostat was used. ~10 samples at 300 seconds per.
> > > > > Processor package power (Watts):
> > > > >
> > > > > Workflow was run for 1 hour each time or 1249201 loops.
> > > > >
> > > > > revert:
> > > > > ave: 3.00
> > > > > min: 2.89
> > > > > max: 3.08
> > > >
> > > > I'm not sure what the above three numbers are.
> > >
> > > Processor package power, Watts.
> >
> > OK
> >
> > > > > ave freq: 2.422 GHz.
> > > > > overruns: 1.
> > > > > max overrun time: 113 uSec.
> > > > >
> > > > > stock:
> > > > > ave: 3.63 (+21%)
> > > > > min: 3.28
> > > > > max: 3.99
> > > > > ave freq: 2.791 GHz.
> > > > > overruns: 2.
> > > > > max overrun time: 677 uSec.
> > > > >
> > > > > rjw-2:
> > > > > ave: 3.14 (+5%)
> > > > > min: 2.97
> > > > > max: 3.28
> > > > > ave freq: 2.635 GHz
> > > >
> > > > I guess the numbers above could be reduced still by using a P-state
> > > > below the max non-turbo one as a limit.
> > >
> > > Yes, and for a test I did "rjw-3".
> > >
> > > > > overruns: 1042.
> > > > > max overrun time: 9,769 uSec.
> > > >
> > > > This would probably get worse then, though.
> > >
> > > Yes, that was my expectation, but not what happened.
> > >
> > > rjw-3:
> > > ave: 3.09 watts
> > > min: 3.01 watts
> > > max: 31.7 watts
> >
> > Did you mean 3.17 here?  It would be hard to get the average of 3.09
> > if the max was over 30 W.
>
> Yes, 3.17 watts was what I meant to write. Sorry for any confusion.
>
> > > ave freq: 2.42 GHz.
> > > overruns: 12. (I did not expect this.)
> > > Max overruns time: 621 uSec.
> > >
> > > Note 1: IRQ's increased by 74%. i.e. it was going in
> > > and out of idle a lot more.
> >
> > That's because the scheduler tick is allowed to run a lot more often
> > in the given workload with the changed test above.
>
> Agreed.
>
> > > Note 2: We know that processor package power
> > > is highly temperature dependent. I forgot to let my
> > > coolant cool adequately after the kernel compile,
> > > and so had to throw out the first 4 power samples
> > > (20 minutes).
> > >
> > > I retested both rjw-2 and rjw-3, but shorter tests
> > > and got 0 overruns in both cases.
> > >
> > > > ATM I'm not quite sure why this happens, but you seem to have some
> > > > insight into it, so it would help if you shared it.
> > >
> > > My insight seems questionable.
> > >
> > > My thinking was that one can not decide if the pstate needs to go
> > > down or not based on such a localized look. The risk being that the
> > > higher periodic load might suffer overruns. Since my first test did exactly
> > > that, I violated my own "repeat all tests 3 times before reporting rule".
> > > Now, I am not sure what is going on.
> > > I will need more time to acquire traces and dig into it.
> >
> > It looks like the workload's behavior depends on updating the CPU
> > performance scaling governor sufficiently often.
> >
> > Previously, that happened through the watchdog workflow that is gone
> > now.  The rjw-2/3 patch is attempting to make up for that by letting
> > the tick run more often.
> >
> > Admittedly, it is somewhat unclear to me why there are not so many
> > overruns in the "stock" kernel test.

The CPU frequency is incorrectly held considerably higher than it
should be. From intel_pstate_tracer data it looks as though
a CPU that is idle is not giving up its vote into the PLL as to
what the CPU frequency should be. However, the culprit isn't
actually that CPU, but rather the uncore, which seems to lock
to that same CPU's pstate. It doesn't unlock from that state
until that CPU calls the intel_pstate driver, sometimes tens of
seconds later. Then things will proceed as normal for a short time
until that same CPU does the same thing and the cycle repeats.

If I lock the uncore at its minimum pstate, then everything is fine
and I can not detect any difference between the "stock" and
"reverted" kernels. Both give better power results than
all previous tests. Processor package power is
approximately 2.75 watts (it is late in my time zone, exact numbers
at a later date).  No overruns.

The shorter maximum durations from the "reverted" kernel merely
cleaned up the mess faster than the "stock" kernel resulting in
less, but still excessive, power consumption.

I only heard about uncore a couple of months ago from Srinivas [1]

[1] https://lore.kernel.org/linux-pm/1b2be990d5c31f62d9ce33aa2eb2530708d5607a.camel@linux.intel.com/

The CPU that the uncore appears to lock to is not the same for
each run of the test, but I know which CPU it was for that run of
the test just by looking at the load graphs from the intel_pstate_tracer
output. Thereafter, manual searching of the data reveals the anomalies.

... Doug

> Perhaps that's because the CPUs
> > generally run fast enough in that case, so the P-state governor need
> > not be updated so often for the CPU frequency to stay high and that's
> > what determines the performance (and in turn that decides whether or
> > not there are overruns and how often they occur).  The other side of
> > the coin is that the updates don't occur often enough for the power
> > draw to be reasonable, though.
>
> Agreed.
>
> I am observing higher CPU frequencies than expected, and can not
> determine why in all cases. It is going to take several days before
> I can either reply with more detail or disprove what I think I am observing.
>
> > Presumably, when the P-state governor gets updated more often (via the
> > scheduler tick), but not often enough, it effectively causes the
> > frequency (and consequently the performance) to drop over relatively
> > long time intervals (and hence the increased occurrence of overruns).
> > If it gets updated even more often (like in rjw-3), though, it causes
> > the CPU frequency to follow the utilization more precisely and so the
> > CPUs don't run "too fast" or "too slowly" for too long.
>
> Agreed.
>
> > > I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> > > and saw several long durations. This was expected as this patch set
> > > wouldn't change durations by more than a few jiffies.
> > > 755 long durations (>6.1 seconds), and 327.7 seconds longest.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-04  5:13                                                   ` Feng Tang
@ 2022-03-04 16:23                                                     ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2022-03-04 16:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	srinivas pandruvada, Zhang, Rui, Thomas Gleixner, stable, x86,
	linux-pm, tim.c.chen

On Fri, Mar 04, 2022 at 01:13:44PM +0800, Feng Tang wrote:
> On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Mar 3, 2022 at 6:27 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> > > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > I guess the numbers above could be reduced still by using a P-state
> > > > > below the max non-turbo one as a limit.
> > > >
> > > > Yes, and for a test I did "rjw-3".
> > > >
> > > > > > overruns: 1042.
> > > > > > max overrun time: 9,769 uSec.
> > > > >
> > > > > This would probably get worse then, though.
> > > >
> > > > Yes, that was my expectation, but not what happened.
> > > >
> > > > rjw-3:
> > > > ave: 3.09 watts
> > > > min: 3.01 watts
> > > > max: 31.7 watts
> > > > ave freq: 2.42 GHz.
> > > > overruns: 12. (I did not expect this.)
> > > > Max overruns time: 621 uSec.
> > > >
> > > > Note 1: IRQ's increased by 74%. i.e. it was going in
> > > > and out of idle a lot more.
> > > >
> > > > Note 2: We know that processor package power
> > > > is highly temperature dependent. I forgot to let my
> > > > coolant cool adequately after the kernel compile,
> > > > and so had to throw out the first 4 power samples
> > > > (20 minutes).
> > > >
> > > > I retested both rjw-2 and rjw-3, but shorter tests
> > > > and got 0 overruns in both cases.
> > >
> > > One thought is can we consider trying the previous debug patch of
> > > calling the util_update when entering idle (time limited).
> > >
> > > In current code, the RT/CFS/Deadline class all have places to call
> > > cpufreq_update_util(), the patch will make sure it is called in all
> > > four classes, also it follows the principle of 'schedutil' of not
> > > introducing more system cost. And surely I could be missing some
> > > details here.
> > >
> > > Following is a cleaner version of the patch, and the code could be
> > > moved down to the internal loop of
> > >
> > >         while (!need_resched()) {
> > >
> > >         }
> > >
> > > Which will make it get called more frequently.
> > 
> > It will, but it's not necessary in all cases.  It only is necessary if
> > the tick is going to be stopped (because the tick will update the
> > P-state governor anyway if it runs).  However, at this point you don't
> > know what's going to happen to the tick.
> > 
> > Moreover, updating the P-state governor before going idle doesn't
> > really help,
> 
> >From Doug's previous test, the power consumption and the delay
> both improved with the debug patch.
> 
> > because the P-state programmed by it at this point may
> > very well be stale after getting out of the idle state, so instead of
> > doing a full update at this point, it should force a low P-state on
> > the way from idle.
> 
> Makes sense.
> 
> Paul has asked about the timer interupts, and here is some more info,
> when there is no active load in system, the longest interval of 4
> seconds between 2 timer interrupts comes from the kernel watchdog
> for hardware/software lockup detector, and every CPU will have this
> timer, which limits the maximum cpu idle duration to be 4 seconds.

And thank you for the info!  One way to reduce this overhead is to boot
with the watchdog_thresh kernel-boot parameter set to some value greater
than 10.  The downside is that HW/FW/NMI catastrophes will need to last
for more than 10 seconds before the system complains.  One approach
is to run with a small watchdog_thresh during testing and a larger one
in production.

							Thanx, Paul

> > > ---
> > >
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index d17b0a5ce6ac..e12688036725 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
> > >   *
> > >   * Called with polling cleared.
> > >   */
> > > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> > >  static void do_idle(void)
> > >  {
> > >         int cpu = smp_processor_id();
> > > +       u64 expire;
> > >
> > >         /*
> > >          * Check if we need to update blocked load
> > >          */
> > >         nohz_run_idle_balance(cpu);
> > >
> > > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > > +       if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
> > > +               cpufreq_update_util(this_rq(), 0);
> > 
> > And quite frankly I'm not sure if running cpufreq_update_util() from
> > here is safe.
> 
> I had that concern too :). Do you mean this is called when the local
> irq is enabled, and could be interrupted causing some issue?
> 
> Thanks,
> Feng
> 
> > > +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> > > +       }
> > > +
> > >         /*
> > >          * If the arch has a polling bit, we maintain an invariant:
> > >          *
> > >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-04  6:59                                                   ` Doug Smythies
@ 2022-03-16 15:54                                                     ` Doug Smythies
  2022-03-17 12:30                                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-16 15:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner, paulmck, stable,
	x86, linux-pm, dsmythies, srinivas pandruvada, Feng Tang

Readers: So that graphs and large attachments could be used, I have
been on an off-list branch of this thread with Srinivas, and copied a
couple of others. While now returning to this on-list thread, I'll
only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave.
The reason was that it is, by far, the best way to obtain good trace data
using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate
then it will stay at that request forever. Ultimately the idle
power goes to about 5.7 watts (verses 1.4 watts expected).
IRQs go very high, as the tick never turns off.
Actually, one knows how many CPUs are stuck requesting a high
pstate just by looking at IRQs.

Trace is useless because it virtually never gets called.
So I have been reading the IA32_PERF_CTL MSR
directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
6 cores, 12 CPUs
min pstate 8
max non-turbo pstate 41
max turbo pstate 48
The system is idle.

doug@s19:~$ sudo
/home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
--quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
Busy%   Bzy_MHz IRQ     PkgWatt
0.11    800     844     1.33
0.01    800     231     1.33
0.11    800     723     1.33 <<< Powersave governor
0.03    889     440     1.33
0.17    4418    21511   4.31 <<< Schedutil governor
0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
0.22    4347    34226   4.75
0.17    4101    43554   4.78
0.29    4300    50565   4.94
0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
0.29    4298    50532   4.84
0.20    4101    50126   4.63
0.20    4101    50149   4.62
0.29    4297    50623   4.76
0.20    4101    50203   4.72
0.29    4295    50642   4.78
0.20    4101    50223   4.68
0.29    4292    50597   4.88
0.20    4101    50208   4.73
0.29    4296    50519   4.84
0.20    4101    50167   4.80
0.20    4101    50242   4.76
0.29    4302    50625   4.94
0.20    4101    50233   4.73
0.29    4296    50613   4.78
0.20    4101    50231   4.70
0.29    4292    50802   4.93
1.46    4669    65610   8.36
0.41    4225    80701   5.48
0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
0.34    4098    80313   5.38
0.41    4228    80689   5.56
0.33    4101    80252   5.46

And the related MSR reads:

3 CPUs are > pstate 41:
root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :

5 CPUs are > psate 41:
root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :

8 CPUs are > pstate 41:
root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :

This issue is independent of the original patch or the suggested modification:

> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index f878a4545eee..94018ac0b59b 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> cpudata *cpu)
>          * P-states to prevent them from getting back to the high frequency
>          * right away after getting out of deep idle.
>          */
> -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> cpu->pstate.min_pstate)/2));
>         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
>  }

... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-16 15:54                                                     ` Doug Smythies
@ 2022-03-17 12:30                                                       ` Rafael J. Wysocki
  2022-03-17 13:58                                                         ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-17 12:30 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm,
	srinivas pandruvada, Feng Tang

On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Readers: So that graphs and large attachments could be used, I have
> been on an off-list branch of this thread with Srinivas, and copied a
> couple of others. While now returning to this on-list thread, I'll
> only take up Rafael's proposed patch.
>
> Hi Rafael,
>
> So far all work has been done with: HWP disabled; intel_pstate; powersave.
> The reason was that it is, by far, the best way to obtain good trace data
> using the intel_pstate_tracer.py utility.
>
> I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
> There is an issue with the proposed patch and schedutil.
>
> If any CPU ever requests a pstate > the max non turbo pstate
> then it will stay at that request forever. Ultimately the idle
> power goes to about 5.7 watts (verses 1.4 watts expected).
> IRQs go very high, as the tick never turns off.
> Actually, one knows how many CPUs are stuck requesting a high
> pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase
/sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and
see what difference this makes.

> Trace is useless because it virtually never gets called.
> So I have been reading the IA32_PERF_CTL MSR
> directly.
>
> Example:
>
> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> 6 cores, 12 CPUs
> min pstate 8
> max non-turbo pstate 41
> max turbo pstate 48
> The system is idle.
>
> doug@s19:~$ sudo
> /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
> Busy%   Bzy_MHz IRQ     PkgWatt
> 0.11    800     844     1.33
> 0.01    800     231     1.33
> 0.11    800     723     1.33 <<< Powersave governor
> 0.03    889     440     1.33
> 0.17    4418    21511   4.31 <<< Schedutil governor
> 0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
> 0.22    4347    34226   4.75
> 0.17    4101    43554   4.78
> 0.29    4300    50565   4.94
> 0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
> 0.29    4298    50532   4.84
> 0.20    4101    50126   4.63
> 0.20    4101    50149   4.62
> 0.29    4297    50623   4.76
> 0.20    4101    50203   4.72
> 0.29    4295    50642   4.78
> 0.20    4101    50223   4.68
> 0.29    4292    50597   4.88
> 0.20    4101    50208   4.73
> 0.29    4296    50519   4.84
> 0.20    4101    50167   4.80
> 0.20    4101    50242   4.76
> 0.29    4302    50625   4.94
> 0.20    4101    50233   4.73
> 0.29    4296    50613   4.78
> 0.20    4101    50231   4.70
> 0.29    4292    50802   4.93
> 1.46    4669    65610   8.36
> 0.41    4225    80701   5.48
> 0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
> 0.34    4098    80313   5.38
> 0.41    4228    80689   5.56
> 0.33    4101    80252   5.46
>
> And the related MSR reads:
>
> 3 CPUs are > pstate 41:
> root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
> 48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :
>
> 5 CPUs are > psate 41:
> root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
> 48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :
>
> 8 CPUs are > pstate 41:
> root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
> 48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :
>
> This issue is independent of the original patch or the suggested modification:
>
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index f878a4545eee..94018ac0b59b 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > cpudata *cpu)
> >          * P-states to prevent them from getting back to the high frequency
> >          * right away after getting out of deep idle.
> >          */
> > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> > +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> > cpu->pstate.min_pstate)/2));
> >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> >  }
>
> ... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-17 12:30                                                       ` Rafael J. Wysocki
@ 2022-03-17 13:58                                                         ` Doug Smythies
  2022-03-24 14:04                                                           ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-17 13:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner, paulmck, stable,
	x86, linux-pm, srinivas pandruvada, Feng Tang, dsmythies

On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies <dsmythies@telus.net> wrote:
> >
> > Readers: So that graphs and large attachments could be used, I have
> > been on an off-list branch of this thread with Srinivas, and copied a
> > couple of others. While now returning to this on-list thread, I'll
> > only take up Rafael's proposed patch.
> >
> > Hi Rafael,
> >
> > So far all work has been done with: HWP disabled; intel_pstate; powersave.
> > The reason was that it is, by far, the best way to obtain good trace data
> > using the intel_pstate_tracer.py utility.
> >
> > I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
> > There is an issue with the proposed patch and schedutil.
> >
> > If any CPU ever requests a pstate > the max non turbo pstate
> > then it will stay at that request forever. Ultimately the idle
> > power goes to about 5.7 watts (verses 1.4 watts expected).
> > IRQs go very high, as the tick never turns off.
> > Actually, one knows how many CPUs are stuck requesting a high
> > pstate just by looking at IRQs.
>
> That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.
>
> Please try to increase
> /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and
> see what difference this makes.

Changing rate_limit_us to 10000, or even 20000, makes no difference.

see a slight clarification to yesterday's email in-line below.

> > Trace is useless because it virtually never gets called.
> > So I have been reading the IA32_PERF_CTL MSR
> > directly.
> >
> > Example:
> >
> > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > 6 cores, 12 CPUs
> > min pstate 8
> > max non-turbo pstate 41
> > max turbo pstate 48
> > The system is idle.
> >
> > doug@s19:~$ sudo
> > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> > --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
> > Busy%   Bzy_MHz IRQ     PkgWatt
> > 0.11    800     844     1.33
> > 0.01    800     231     1.33
> > 0.11    800     723     1.33 <<< Powersave governor
> > 0.03    889     440     1.33
> > 0.17    4418    21511   4.31 <<< Schedutil governor
> > 0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
> > 0.22    4347    34226   4.75
> > 0.17    4101    43554   4.78
> > 0.29    4300    50565   4.94
> > 0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
> > 0.29    4298    50532   4.84
> > 0.20    4101    50126   4.63
> > 0.20    4101    50149   4.62
> > 0.29    4297    50623   4.76
> > 0.20    4101    50203   4.72
> > 0.29    4295    50642   4.78
> > 0.20    4101    50223   4.68
> > 0.29    4292    50597   4.88
> > 0.20    4101    50208   4.73
> > 0.29    4296    50519   4.84
> > 0.20    4101    50167   4.80
> > 0.20    4101    50242   4.76
> > 0.29    4302    50625   4.94
> > 0.20    4101    50233   4.73
> > 0.29    4296    50613   4.78
> > 0.20    4101    50231   4.70
> > 0.29    4292    50802   4.93
> > 1.46    4669    65610   8.36
> > 0.41    4225    80701   5.48
> > 0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
> > 0.34    4098    80313   5.38
> > 0.41    4228    80689   5.56
> > 0.33    4101    80252   5.46
> >
> > And the related MSR reads:
> >
> > 3 CPUs are > pstate 41:
> > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
> > 48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :
> >
> > 5 CPUs are > psate 41:
> > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
> > 48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :
> >
> > 8 CPUs are > pstate 41:
> > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
> > 48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :
> >
> > This issue is independent of the original patch or the suggested modification:

Actually, the issue threshold is as defined by the greater than condition below.

> >
> > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > > index f878a4545eee..94018ac0b59b 100644
> > > --- a/drivers/cpufreq/intel_pstate.c
> > > +++ b/drivers/cpufreq/intel_pstate.c
> > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > > cpudata *cpu)
> > >          * P-states to prevent them from getting back to the high frequency
> > >          * right away after getting out of deep idle.
> > >          */
> > > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);

For the above kernel the threshold is pstate 42.

> > > +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> > > cpu->pstate.min_pstate)/2));

For the above kernel the threshold is pstate 25.

> > >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> > >  }
> >
> > ... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-17 13:58                                                         ` Doug Smythies
@ 2022-03-24 14:04                                                           ` Doug Smythies
  2022-03-24 18:17                                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 47+ messages in thread
From: Doug Smythies @ 2022-03-24 14:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner, paulmck, stable,
	x86, linux-pm, srinivas pandruvada, Feng Tang, dsmythies

Hi Rafael,

Do you have any suggestions for the proposed patch?
I have tried to figure out what is wrong but haven't been able to.

... Doug

On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies <dsmythies@telus.net> wrote:
> > >
> > > Readers: So that graphs and large attachments could be used, I have
> > > been on an off-list branch of this thread with Srinivas, and copied a
> > > couple of others. While now returning to this on-list thread, I'll
> > > only take up Rafael's proposed patch.
> > >
> > > Hi Rafael,
> > >
> > > So far all work has been done with: HWP disabled; intel_pstate; powersave.
> > > The reason was that it is, by far, the best way to obtain good trace data
> > > using the intel_pstate_tracer.py utility.
> > >
> > > I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
> > > There is an issue with the proposed patch and schedutil.
> > >
> > > If any CPU ever requests a pstate > the max non turbo pstate
> > > then it will stay at that request forever. Ultimately the idle
> > > power goes to about 5.7 watts (verses 1.4 watts expected).
> > > IRQs go very high, as the tick never turns off.
> > > Actually, one knows how many CPUs are stuck requesting a high
> > > pstate just by looking at IRQs.
> >
> > That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.
> >
> > Please try to increase
> > /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and
> > see what difference this makes.
>
> Changing rate_limit_us to 10000, or even 20000, makes no difference.
>
> see a slight clarification to yesterday's email in-line below.
>
> > > Trace is useless because it virtually never gets called.
> > > So I have been reading the IA32_PERF_CTL MSR
> > > directly.
> > >
> > > Example:
> > >
> > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > 6 cores, 12 CPUs
> > > min pstate 8
> > > max non-turbo pstate 41
> > > max turbo pstate 48
> > > The system is idle.
> > >
> > > doug@s19:~$ sudo
> > > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> > > --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
> > > Busy%   Bzy_MHz IRQ     PkgWatt
> > > 0.11    800     844     1.33
> > > 0.01    800     231     1.33
> > > 0.11    800     723     1.33 <<< Powersave governor
> > > 0.03    889     440     1.33
> > > 0.17    4418    21511   4.31 <<< Schedutil governor
> > > 0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
> > > 0.22    4347    34226   4.75
> > > 0.17    4101    43554   4.78
> > > 0.29    4300    50565   4.94
> > > 0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
> > > 0.29    4298    50532   4.84
> > > 0.20    4101    50126   4.63
> > > 0.20    4101    50149   4.62
> > > 0.29    4297    50623   4.76
> > > 0.20    4101    50203   4.72
> > > 0.29    4295    50642   4.78
> > > 0.20    4101    50223   4.68
> > > 0.29    4292    50597   4.88
> > > 0.20    4101    50208   4.73
> > > 0.29    4296    50519   4.84
> > > 0.20    4101    50167   4.80
> > > 0.20    4101    50242   4.76
> > > 0.29    4302    50625   4.94
> > > 0.20    4101    50233   4.73
> > > 0.29    4296    50613   4.78
> > > 0.20    4101    50231   4.70
> > > 0.29    4292    50802   4.93
> > > 1.46    4669    65610   8.36
> > > 0.41    4225    80701   5.48
> > > 0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
> > > 0.34    4098    80313   5.38
> > > 0.41    4228    80689   5.56
> > > 0.33    4101    80252   5.46
> > >
> > > And the related MSR reads:
> > >
> > > 3 CPUs are > pstate 41:
> > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
> > > 48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :
> > >
> > > 5 CPUs are > psate 41:
> > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
> > > 48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :
> > >
> > > 8 CPUs are > pstate 41:
> > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
> > > 48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :
> > >
> > > This issue is independent of the original patch or the suggested modification:
>
> Actually, the issue threshold is as defined by the greater than condition below.
>
> > >
> > > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > > > index f878a4545eee..94018ac0b59b 100644
> > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > > > cpudata *cpu)
> > > >          * P-states to prevent them from getting back to the high frequency
> > > >          * right away after getting out of deep idle.
> > > >          */
> > > > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
>
> For the above kernel the threshold is pstate 42.
>
> > > > +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> > > > cpu->pstate.min_pstate)/2));
>
> For the above kernel the threshold is pstate 25.
>
> > > >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> > > >  }
> > >
> > > ... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-24 14:04                                                           ` Doug Smythies
@ 2022-03-24 18:17                                                             ` Rafael J. Wysocki
  2022-03-25  0:03                                                               ` Doug Smythies
  0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2022-03-24 18:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Zhang, Rui,
	Thomas Gleixner, paulmck, stable, x86, linux-pm,
	srinivas pandruvada, Feng Tang

Hi Doug,

On Thu, Mar 24, 2022 at 3:04 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> Do you have any suggestions for the proposed patch?

Not really.

It looks like the avoidance to stop the scheduler tick is sufficient
to bump up the PELT signal for this workload in such a way that it
doesn't fall below a certain level at all which in turn causes
schedutil to ask for higher frequencies.

An alternative approach appears to be necessary, but I need some more
time for that.

> I have tried to figure out what is wrong but haven't been able to.
>
> ... Doug
>
> On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies <dsmythies@telus.net> wrote:
> >
> > On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > >
> > > > Readers: So that graphs and large attachments could be used, I have
> > > > been on an off-list branch of this thread with Srinivas, and copied a
> > > > couple of others. While now returning to this on-list thread, I'll
> > > > only take up Rafael's proposed patch.
> > > >
> > > > Hi Rafael,
> > > >
> > > > So far all work has been done with: HWP disabled; intel_pstate; powersave.
> > > > The reason was that it is, by far, the best way to obtain good trace data
> > > > using the intel_pstate_tracer.py utility.
> > > >
> > > > I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
> > > > There is an issue with the proposed patch and schedutil.
> > > >
> > > > If any CPU ever requests a pstate > the max non turbo pstate
> > > > then it will stay at that request forever. Ultimately the idle
> > > > power goes to about 5.7 watts (verses 1.4 watts expected).
> > > > IRQs go very high, as the tick never turns off.
> > > > Actually, one knows how many CPUs are stuck requesting a high
> > > > pstate just by looking at IRQs.
> > >
> > > That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.
> > >
> > > Please try to increase
> > > /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and
> > > see what difference this makes.
> >
> > Changing rate_limit_us to 10000, or even 20000, makes no difference.
> >
> > see a slight clarification to yesterday's email in-line below.
> >
> > > > Trace is useless because it virtually never gets called.
> > > > So I have been reading the IA32_PERF_CTL MSR
> > > > directly.
> > > >
> > > > Example:
> > > >
> > > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > > 6 cores, 12 CPUs
> > > > min pstate 8
> > > > max non-turbo pstate 41
> > > > max turbo pstate 48
> > > > The system is idle.
> > > >
> > > > doug@s19:~$ sudo
> > > > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> > > > --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
> > > > Busy%   Bzy_MHz IRQ     PkgWatt
> > > > 0.11    800     844     1.33
> > > > 0.01    800     231     1.33
> > > > 0.11    800     723     1.33 <<< Powersave governor
> > > > 0.03    889     440     1.33
> > > > 0.17    4418    21511   4.31 <<< Schedutil governor
> > > > 0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
> > > > 0.22    4347    34226   4.75
> > > > 0.17    4101    43554   4.78
> > > > 0.29    4300    50565   4.94
> > > > 0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
> > > > 0.29    4298    50532   4.84
> > > > 0.20    4101    50126   4.63
> > > > 0.20    4101    50149   4.62
> > > > 0.29    4297    50623   4.76
> > > > 0.20    4101    50203   4.72
> > > > 0.29    4295    50642   4.78
> > > > 0.20    4101    50223   4.68
> > > > 0.29    4292    50597   4.88
> > > > 0.20    4101    50208   4.73
> > > > 0.29    4296    50519   4.84
> > > > 0.20    4101    50167   4.80
> > > > 0.20    4101    50242   4.76
> > > > 0.29    4302    50625   4.94
> > > > 0.20    4101    50233   4.73
> > > > 0.29    4296    50613   4.78
> > > > 0.20    4101    50231   4.70
> > > > 0.29    4292    50802   4.93
> > > > 1.46    4669    65610   8.36
> > > > 0.41    4225    80701   5.48
> > > > 0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
> > > > 0.34    4098    80313   5.38
> > > > 0.41    4228    80689   5.56
> > > > 0.33    4101    80252   5.46
> > > >
> > > > And the related MSR reads:
> > > >
> > > > 3 CPUs are > pstate 41:
> > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
> > > > 48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :
> > > >
> > > > 5 CPUs are > psate 41:
> > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
> > > > 48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :
> > > >
> > > > 8 CPUs are > pstate 41:
> > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
> > > > 48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :
> > > >
> > > > This issue is independent of the original patch or the suggested modification:
> >
> > Actually, the issue threshold is as defined by the greater than condition below.
> >
> > > >
> > > > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > > > > index f878a4545eee..94018ac0b59b 100644
> > > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > > > > cpudata *cpu)
> > > > >          * P-states to prevent them from getting back to the high frequency
> > > > >          * right away after getting out of deep idle.
> > > > >          */
> > > > > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> >
> > For the above kernel the threshold is pstate 42.
> >
> > > > > +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> > > > > cpu->pstate.min_pstate)/2));
> >
> > For the above kernel the threshold is pstate 25.
> >
> > > > >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> > > > >  }
> > > >
> > > > ... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: CPU excessively long times between frequency scaling driver calls - bisected
  2022-03-24 18:17                                                             ` Rafael J. Wysocki
@ 2022-03-25  0:03                                                               ` Doug Smythies
  0 siblings, 0 replies; 47+ messages in thread
From: Doug Smythies @ 2022-03-25  0:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Zhang, Rui, Thomas Gleixner, paulmck, stable,
	x86, linux-pm, srinivas pandruvada, Feng Tang, dsmythies

On Thu, Mar 24, 2022 at 11:17 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> Hi Doug,
>
> On Thu, Mar 24, 2022 at 3:04 PM Doug Smythies <dsmythies@telus.net> wrote:
> >
> > Hi Rafael,
> >
> > Do you have any suggestions for the proposed patch?
>
> Not really.
>
> It looks like the avoidance to stop the scheduler tick is sufficient
> to bump up the PELT signal for this workload in such a way that it
> doesn't fall below a certain level at all which in turn causes
> schedutil to ask for higher frequencies.
>
> An alternative approach appears to be necessary, but I need some more
> time for that.

Hi Rafael,

O.K. thanks for the reply.
This can be sidelined for now if you prefer.
As mentioned in one of the off-list emails:

I was always aware that we might be heading towards a solution
tailored to my specific test workflow. It is one relatively simple
thing to create an example workflow that exploits the issue, but quite
another to claim that the proposed solution works for any workflow and
hardware.

>
> > I have tried to figure out what is wrong but haven't been able to.
> >
> > ... Doug
> >
> > On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies <dsmythies@telus.net> wrote:
> > >
> > > On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > >
> > > > > Readers: So that graphs and large attachments could be used, I have
> > > > > been on an off-list branch of this thread with Srinivas, and copied a
> > > > > couple of others. While now returning to this on-list thread, I'll
> > > > > only take up Rafael's proposed patch.
> > > > >
> > > > > Hi Rafael,
> > > > >
> > > > > So far all work has been done with: HWP disabled; intel_pstate; powersave.
> > > > > The reason was that it is, by far, the best way to obtain good trace data
> > > > > using the intel_pstate_tracer.py utility.
> > > > >
> > > > > I always intended to try/test: HWP disabled; intel_cpufreq; schedutil.
> > > > > There is an issue with the proposed patch and schedutil.
> > > > >
> > > > > If any CPU ever requests a pstate > the max non turbo pstate
> > > > > then it will stay at that request forever. Ultimately the idle
> > > > > power goes to about 5.7 watts (verses 1.4 watts expected).
> > > > > IRQs go very high, as the tick never turns off.
> > > > > Actually, one knows how many CPUs are stuck requesting a high
> > > > > pstate just by looking at IRQs.
> > > >
> > > > That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.
> > > >
> > > > Please try to increase
> > > > /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and
> > > > see what difference this makes.
> > >
> > > Changing rate_limit_us to 10000, or even 20000, makes no difference.
> > >
> > > see a slight clarification to yesterday's email in-line below.
> > >
> > > > > Trace is useless because it virtually never gets called.
> > > > > So I have been reading the IA32_PERF_CTL MSR
> > > > > directly.
> > > > >
> > > > > Example:
> > > > >
> > > > > Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > > > 6 cores, 12 CPUs
> > > > > min pstate 8
> > > > > max non-turbo pstate 41
> > > > > max turbo pstate 48
> > > > > The system is idle.
> > > > >
> > > > > doug@s19:~$ sudo
> > > > > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
> > > > > --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10
> > > > > Busy%   Bzy_MHz IRQ     PkgWatt
> > > > > 0.11    800     844     1.33
> > > > > 0.01    800     231     1.33
> > > > > 0.11    800     723     1.33 <<< Powersave governor
> > > > > 0.03    889     440     1.33
> > > > > 0.17    4418    21511   4.31 <<< Schedutil governor
> > > > > 0.12    4101    30153   4.48 <<< 3 CPUs are > pstate 41
> > > > > 0.22    4347    34226   4.75
> > > > > 0.17    4101    43554   4.78
> > > > > 0.29    4300    50565   4.94
> > > > > 0.21    4098    50297   4.76 <<< 5 CPUs are > pstate 41
> > > > > 0.29    4298    50532   4.84
> > > > > 0.20    4101    50126   4.63
> > > > > 0.20    4101    50149   4.62
> > > > > 0.29    4297    50623   4.76
> > > > > 0.20    4101    50203   4.72
> > > > > 0.29    4295    50642   4.78
> > > > > 0.20    4101    50223   4.68
> > > > > 0.29    4292    50597   4.88
> > > > > 0.20    4101    50208   4.73
> > > > > 0.29    4296    50519   4.84
> > > > > 0.20    4101    50167   4.80
> > > > > 0.20    4101    50242   4.76
> > > > > 0.29    4302    50625   4.94
> > > > > 0.20    4101    50233   4.73
> > > > > 0.29    4296    50613   4.78
> > > > > 0.20    4101    50231   4.70
> > > > > 0.29    4292    50802   4.93
> > > > > 1.46    4669    65610   8.36
> > > > > 0.41    4225    80701   5.48
> > > > > 0.33    4101    80219   5.36 <<< 8 CPUs are > ptstate 41
> > > > > 0.34    4098    80313   5.38
> > > > > 0.41    4228    80689   5.56
> > > > > 0.33    4101    80252   5.46
> > > > >
> > > > > And the related MSR reads:
> > > > >
> > > > > 3 CPUs are > pstate 41:
> > > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  30 :   8 :   8 :  48 :
> > > > > 48 :  48 :   8 :  30 :  31 :   8 :   8 :   8 :
> > > > >
> > > > > 5 CPUs are > psate 41:
> > > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  44 :  30 :  31 :  48 :
> > > > > 48 :  48 :   8 :   8 :   8 :   8 :  48 :   8 :
> > > > >
> > > > > 8 CPUs are > pstate 41:
> > > > > root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL
> > > > > 9.) 0x199: IA32_PERF_CTL        : CPU 0-11 :  45 :  48 :  48 :  48 :
> > > > > 48 :  48 :   8 :  30 :   8 :   8 :  48 :  42 :
> > > > >
> > > > > This issue is independent of the original patch or the suggested modification:
> > >
> > > Actually, the issue threshold is as defined by the greater than condition below.
> > >
> > > > >
> > > > > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > > > > > index f878a4545eee..94018ac0b59b 100644
> > > > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > > > > > cpudata *cpu)
> > > > > >          * P-states to prevent them from getting back to the high frequency
> > > > > >          * right away after getting out of deep idle.
> > > > > >          */
> > > > > > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> > >
> > > For the above kernel the threshold is pstate 42.
> > >
> > > > > > +       cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
> > > > > > cpu->pstate.min_pstate)/2));
> > >
> > > For the above kernel the threshold is pstate 25.
> > >
> > > > > >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> > > > > >  }
> > > > >
> > > > > ... Doug

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-03-25  0:03 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  1:40 CPU excessively long times between frequency scaling driver calls - bisected Doug Smythies
2022-02-08  2:39 ` Feng Tang
2022-02-08  7:13   ` Doug Smythies
2022-02-08  9:15     ` Feng Tang
2022-02-09  6:23       ` Doug Smythies
2022-02-10  7:45         ` Zhang, Rui
2022-02-13 18:54           ` Doug Smythies
2022-02-14 15:17             ` srinivas pandruvada
2022-02-15 21:35               ` Doug Smythies
2022-02-22  7:34               ` Feng Tang
2022-02-22 18:04                 ` Rafael J. Wysocki
2022-02-23  0:07                   ` Doug Smythies
2022-02-23  0:32                     ` srinivas pandruvada
2022-02-23  0:40                       ` Feng Tang
2022-02-23 14:23                         ` Rafael J. Wysocki
2022-02-24  8:08                           ` Feng Tang
2022-02-24 14:44                             ` Paul E. McKenney
2022-02-24 16:29                               ` Doug Smythies
2022-02-24 16:58                                 ` Paul E. McKenney
2022-02-25  0:29                               ` Feng Tang
2022-02-25  1:06                                 ` Paul E. McKenney
2022-02-25 17:45                             ` Rafael J. Wysocki
2022-02-26  0:36                               ` Doug Smythies
2022-02-28  4:12                                 ` Feng Tang
2022-02-28 19:36                                   ` Rafael J. Wysocki
2022-03-01  5:52                                     ` Feng Tang
2022-03-01 11:58                                       ` Rafael J. Wysocki
2022-03-01 17:18                                         ` Doug Smythies
2022-03-01 17:34                                           ` Rafael J. Wysocki
2022-03-02  4:06                                             ` Doug Smythies
2022-03-02 19:00                                               ` Rafael J. Wysocki
2022-03-03 23:00                                                 ` Doug Smythies
2022-03-04  6:59                                                   ` Doug Smythies
2022-03-16 15:54                                                     ` Doug Smythies
2022-03-17 12:30                                                       ` Rafael J. Wysocki
2022-03-17 13:58                                                         ` Doug Smythies
2022-03-24 14:04                                                           ` Doug Smythies
2022-03-24 18:17                                                             ` Rafael J. Wysocki
2022-03-25  0:03                                                               ` Doug Smythies
2022-03-03  5:27                                               ` Feng Tang
2022-03-03 12:02                                                 ` Rafael J. Wysocki
2022-03-04  5:13                                                   ` Feng Tang
2022-03-04 16:23                                                     ` Paul E. McKenney
2022-02-23  2:49                   ` Feng Tang
2022-02-23 14:11                     ` Rafael J. Wysocki
2022-02-23  9:40                   ` Thomas Gleixner
2022-02-23 14:23                     ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.