SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)

All of lore.kernel.org
 help / color / mirror / Atom feed

* SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
@ 2016-03-01 17:51 Len Brown
       [not found] ` <87si087tsr.fsf@iki.fi>
  0 siblings, 1 reply; 59+ messages in thread
From: Len Brown @ 2016-03-01 17:51 UTC (permalink / raw)
  To: Arto Jantunen
  Cc: Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, Doug Smythies,
	Rafael J. Wysocki, linux-pm, Rik van Riel

> Since this is about cpuidle, I'll also mention that this hardware
> requires idle=nomwait on the command line, otherwise the kernel will not
> boot.

Arto,

A boot failure is even more disruptive than bad P-state and C-state choices!
Please tell us more about it.

What c-states are used when you use "idle=nomwait"?
you can see them this way:

grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*

Please boot with intel_idle.max_cstate=0

If that also boots, please again show the available c-states via the grep above.

Does it happen if you use default cmdline,
but edit intel_idle.c to comment out c8 and c9 support?

Please run
# turbostat --debug | tee ts.out 2>&1
and send ts.out to show the deepest idle state that this system can
actually enter.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
       [not found] ` <87si087tsr.fsf@iki.fi>
@ 2016-03-02 17:10   ` Rik van Riel
  2016-03-08 21:13     ` Len Brown
  2016-03-08 21:19   ` Len Brown
  1 sibling, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-02 17:10 UTC (permalink / raw)
  To: Arto Jantunen, Len Brown
  Cc: Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, Doug Smythies,
	Rafael J. Wysocki, linux-pm

[-- Attachment #1: Type: text/plain, Size: 5784 bytes --]

On Wed, 2016-03-02 at 18:01 +0200, Arto Jantunen wrote:
> Len Brown <lenb@kernel.org> writes:
> 
> > > Since this is about cpuidle, I'll also mention that this hardware
> > > requires idle=nomwait on the command line, otherwise the kernel
> > > will not
> > > boot.
> > 
> > Arto,
> > 
> > A boot failure is even more disruptive than bad P-state and C-state 
> > choices!
> > Please tell us more about it.
> > 
> > What c-states are used when you use "idle=nomwait"?
> > you can see them this way:
> > 
> > grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
> 
> /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL
> IDLE
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
> /sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/time:1405175
> /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:135

It looks like the system spent some time in idle
poll, but not that much.

> /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI HLT
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> /sys/devices/system/cpu/cpu0/cpuidle/state1/time:69898878
> /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:25539

This looks reasonable, with an exit latency of 1 us,
and a lot of idle time spent here.

> /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI IOPORT 0x1816
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:151
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/residency:302
> /sys/devices/system/cpu/cpu0/cpuidle/state2/time:28132
> /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:10724

The exit latency for your C2 and C3 states look excessive,
with 151 and 1034 microseconds, respectively.

I believe there are supposed to be some lower latency
idle states in there. The skl_cstates table in
drivers/idle/intel_idle.c suggests you should be seeing
states with exit latencies of 2, 10, 70, 85, and 124
us before getting to that C2 state you see above.

As you can see from the "time" stats, your CPU spends
most of its time in HLT or polling, and very little time
in these deeper, much much slower idle states.

Having the mwait based idle states working would probably
get the CPU to spend a lot less time in HLT, and more time
in "proper" idle states.

> /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI IOPORT 0x1819
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:1034
> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
> /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/residency:2068
> /sys/devices/system/cpu/cpu0/cpuidle/state3/time:14169
> /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:3375



> > Please boot with intel_idle.max_cstate=0
> > 
> > If that also boots, please again show the available c-states via
> > the grep above.
> 
> It does, here:
> 
> /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL
> IDLE
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
> /sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/time:30951
> /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:132
> /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT
> 0x0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> /sys/devices/system/cpu/cpu0/cpuidle/state1/time:4231850
> /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:14709
> /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT
> 0x33
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:151
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/residency:302
> /sys/devices/system/cpu/cpu0/cpuidle/state2/time:6524492
> /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:4621
> /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH INTEL MWAIT
> 0x60
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:1034
> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
> /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/residency:2068
> /sys/devices/system/cpu/cpu0/cpuidle/state3/time:52217106
> /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:3364
> 
> > Does it happen if you use default cmdline,
> > but edit intel_idle.c to comment out c8 and c9 support?
> 
> That also boots, ts.out attached (as is the patch to do this, to make
> sure there is no miscommunication). With this change the original bug
> disappears as well.
> 
-- 
-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-02 17:10   ` Rik van Riel
@ 2016-03-08 21:13     ` Len Brown
  0 siblings, 0 replies; 59+ messages in thread
From: Len Brown @ 2016-03-08 21:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Arto Jantunen, Viresh Kumar, Srinivas Pandruvada, Chen, Yu C,
	Doug Smythies, Rafael J. Wysocki, linux-pm

> Having the mwait based idle states working would probably
> get the CPU to spend a lot less time in HLT, and more time
> in "proper" idle states.

Agreed, the ACPI IOPORT latencies are sort of nonsense on this box,
and the workaround for "intel_idle.max_cstate" is a better workaround than
"idle=nomwait", because it at least allows ACPI to use mwait.

But the difference is sort of minor compared to the other issues a play.

>> /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI IOPORT 0x1819
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:1034
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/residency:2068
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/time:14169
>> /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:3375

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
       [not found] ` <87si087tsr.fsf@iki.fi>
  2016-03-02 17:10   ` Rik van Riel
@ 2016-03-08 21:19   ` Len Brown
  2016-03-09 17:01     ` Arto Jantunen
  1 sibling, 1 reply; 59+ messages in thread
From: Len Brown @ 2016-03-08 21:19 UTC (permalink / raw)
  To: Arto Jantunen
  Cc: Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, Doug Smythies,
	Rafael J. Wysocki, linux-pm, Rik van Riel

On Wed, Mar 2, 2016 at 11:01 AM, Arto Jantunen <viiru@iki.fi> wrote:
> Len Brown <lenb@kernel.org> writes:
>
>>> Since this is about cpuidle, I'll also mention that this hardware
>>> requires idle=nomwait on the command line, otherwise the kernel will not
>>> boot.
>>
>> Arto,
>>
>> A boot failure is even more disruptive than bad P-state and C-state choices!
>> Please tell us more about it.
>>
>> What c-states are used when you use "idle=nomwait"?
>> you can see them this way:
>>
>> grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
>
> /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
> /sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/time:1405175
> /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:135
> /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI HLT
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> /sys/devices/system/cpu/cpu0/cpuidle/state1/time:69898878
> /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:25539
> /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI IOPORT 0x1816
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:151
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/residency:302
> /sys/devices/system/cpu/cpu0/cpuidle/state2/time:28132
> /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:10724
> /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI IOPORT 0x1819
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:1034
> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
> /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/residency:2068
> /sys/devices/system/cpu/cpu0/cpuidle/state3/time:14169
> /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:3375
>
>> Please boot with intel_idle.max_cstate=0
>>
>> If that also boots, please again show the available c-states via the grep above.
>
> It does, here:
>
> /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
> /sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state0/time:30951
> /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:132
> /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> /sys/devices/system/cpu/cpu0/cpuidle/state1/time:4231850
> /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:14709
> /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x33
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:151
> /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/residency:302
> /sys/devices/system/cpu/cpu0/cpuidle/state2/time:6524492
> /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:4621
> /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x60
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:1034
> /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
> /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/residency:2068
> /sys/devices/system/cpu/cpu0/cpuidle/state3/time:52217106
> /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:3364
>
>> Does it happen if you use default cmdline,
>> but edit intel_idle.c to comment out c8 and c9 support?
>
> That also boots, ts.out attached (as is the patch to do this, to make
> sure there is no miscommunication).

yes, patch looks good, and thanks for running turbostat.

If your system boots and runs properly with "intel_idle.max_cstate=7",
then you are suffering this issue:

https://bugzilla.kernel.org/show_bug.cgi?id=109081

> With this change the original bug disappears as well.

Hmm, so your problems with frequency staying constant,
as well as the boot issue both went away by deleting c8 and c9?

Still works with intel_idle.max_cstate=7 ?

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-08 21:19   ` Len Brown
@ 2016-03-09 17:01     ` Arto Jantunen
  2016-03-09 23:03       ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Arto Jantunen @ 2016-03-09 17:01 UTC (permalink / raw)
  To: Len Brown
  Cc: Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, Doug Smythies,
	Rafael J. Wysocki, linux-pm, Rik van Riel

Len Brown <lenb@kernel.org> writes:

> On Wed, Mar 2, 2016 at 11:01 AM, Arto Jantunen <viiru@iki.fi> wrote:
>> Len Brown <lenb@kernel.org> writes:
>>> Does it happen if you use default cmdline,
>>> but edit intel_idle.c to comment out c8 and c9 support?
>>
>> That also boots, ts.out attached (as is the patch to do this, to make
>> sure there is no miscommunication).
>
> yes, patch looks good, and thanks for running turbostat.
>
> If your system boots and runs properly with "intel_idle.max_cstate=7",
> then you are suffering this issue:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109081

That's the one, indeed. Based on the report this is actually a microcode
bug?

This is the version info that I see at boot:

microcode: CPU0 sig=0x506e3, pf=0x20, revision=0x33

The machine is an Asus GL552VW (reported as affected in the bug report),
with BIOS version 216, which is the latest.

>> With this change the original bug disappears as well.
>
> Hmm, so your problems with frequency staying constant,
> as well as the boot issue both went away by deleting c8 and c9?

Yes. I assume that removing idle=nomwait is the fix for the frequency
issue, while removing the deepest idle states fixes the boot issue.

> Still works with intel_idle.max_cstate=7 ?

I'll test this.

-- 
Arto Jantunen

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-09 17:01     ` Arto Jantunen
@ 2016-03-09 23:03       ` Doug Smythies
  2016-03-09 23:18         ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-09 23:03 UTC (permalink / raw)
  To: 'Rik van Riel', 'Rafael J. Wysocki'
  Cc: 'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

Does the direction that this thread has taken mean that those two
commits that Arto identified might now not be reverted?

Commits:
9c4b2867ed7c8c8784dd417ffd16e705e81eb145
a9ceb78bc75ca47972096372ff3d48648b16317a

Recall that increased energy consumption was also isolated to those
two commits for some types of workflow. The suggestion is that the
commits should be reverted anyhow.

What seems to be happening is that CPUs are deciding to stay in idle
state 0 a lot more, when they actually could / should be in a deeper
idle state.

Test time is always: 33 minutes and 20 seconds test (8 CPUs), and there
is never any noticeable difference in execution times for the work:

Example 1: kernel 4.5-rc5 with the rjw 3 patch set version 10
"Replace timers with utilization update callbacks":
Aggregate times in each idle state:

rjwv10 (minutes)	reverted (minutes)	Idle State
27.0654211		2.617325533			0
12.92451315		24.53266672			1
3.668558467		3.780482633			2
1.2727832		1.61352195			3
130.8342596		141.2234947			4

175.7655355		173.7674915			total

Example 2: kernel 4.5-rc7 stock:
Aggregate times in each idle state:

k45rc7 (minutes)	reverted (minutes)	Idle State
20.1771917		2.638311483			0
13.02770225		21.81474838			1
3.428136783		3.951405			2
1.4540243		1.552488167			3
134.9057413		143.5533			4

172.9927963		173.5102531			total

Energy (restated from a previous e-mail):

Test 7: reverted: Package Joules: 47830
Test 8: rjwv10: Package Joules: 54419 (revert saves 12.1% energy)

Test 9: reverted: Package Joules: 49326
Test 10: rjwv10: Package Joules: 55442 (revert saves 11% energy)

Test 11: reverted: acpi-cpufreq ondemand: Package Joules: 49146
Test 12: rjwv10: acpi-cpufreq ondemand: Package Joules: 56302 (revert saves 12.7% energy)

Energy (not in any previous e-mail):

Reverted: 56178 Joules
Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)

Note 1: The test computer was rebuilt in between old and new tests,
so the baseline energy may have changed.

Note 2: Unless otherwise stated, intel_pstate driver, powersave governor.

... Doug

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-09 23:03       ` Doug Smythies
@ 2016-03-09 23:18         ` Rafael J. Wysocki
  2016-03-09 23:45           ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-09 23:18 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rik van Riel, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Thu, Mar 10, 2016 at 12:03 AM, Doug Smythies <dsmythies@telus.net> wrote:
> Does the direction that this thread has taken mean that those two
> commits that Arto identified might now not be reverted?

Yes.

> Commits:
> 9c4b2867ed7c8c8784dd417ffd16e705e81eb145
> a9ceb78bc75ca47972096372ff3d48648b16317a
>
> Recall that increased energy consumption was also isolated to those
> two commits for some types of workflow. The suggestion is that the
> commits should be reverted anyhow.
>
> What seems to be happening is that CPUs are deciding to stay in idle
> state 0 a lot more, when they actually could / should be in a deeper
> idle state.
>
> Test time is always: 33 minutes and 20 seconds test (8 CPUs), and there
> is never any noticeable difference in execution times for the work:
>
> Example 1: kernel 4.5-rc5 with the rjw 3 patch set version 10
> "Replace timers with utilization update callbacks":
> Aggregate times in each idle state:

You need to say what workload that is too.

> rjwv10 (minutes)        reverted (minutes)      Idle State
> 27.0654211              2.617325533                     0
> 12.92451315             24.53266672                     1
> 3.668558467             3.780482633                     2
> 1.2727832               1.61352195                      3
> 130.8342596             141.2234947                     4
>
> 175.7655355             173.7674915                     total
>
> Example 2: kernel 4.5-rc7 stock:
> Aggregate times in each idle state:
>
> k45rc7 (minutes)        reverted (minutes)      Idle State
> 20.1771917              2.638311483                     0
> 13.02770225             21.81474838                     1
> 3.428136783             3.951405                        2
> 1.4540243               1.552488167                     3
> 134.9057413             143.5533                        4
>
> 172.9927963             173.5102531                     total
>
> Energy (restated from a previous e-mail):
>
> Test 7: reverted: Package Joules: 47830
> Test 8: rjwv10: Package Joules: 54419 (revert saves 12.1% energy)
>
> Test 9: reverted: Package Joules: 49326
> Test 10: rjwv10: Package Joules: 55442 (revert saves 11% energy)
>
> Test 11: reverted: acpi-cpufreq ondemand: Package Joules: 49146
> Test 12: rjwv10: acpi-cpufreq ondemand: Package Joules: 56302 (revert saves 12.7% energy)
>
> Energy (not in any previous e-mail):
>
> Reverted: 56178 Joules
> Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)

So the claim in commit a9ceb78bc75ca47972096372ff3d48648b16317a is
that the change should not really affect systems with low C1
latencies.

I wonder what's the C1 latency on your test system?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-09 23:18         ` Rafael J. Wysocki
@ 2016-03-09 23:45           ` Doug Smythies
  2016-03-09 23:59             ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-09 23:45 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rik van Riel', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On 2016.03.09 15:18 Rafael J. Wysocki wrote:
> On Thu, Mar 10, 2016 at 12:03 AM, Doug Smythies <dsmythies@telus.net> wrote:
>> Does the direction that this thread has taken mean that those two
>> commits that Arto identified might now not be reverted?
>
> Yes.
>
>> Commits:
>> 9c4b2867ed7c8c8784dd417ffd16e705e81eb145
>> a9ceb78bc75ca47972096372ff3d48648b16317a
>>
>> Recall that increased energy consumption was also isolated to those
>> two commits for some types of workflow. The suggestion is that the
>> commits should be reverted anyhow.
>>
>> What seems to be happening is that CPUs are deciding to stay in idle
>> state 0 a lot more, when they actually could / should be in a deeper
>> idle state.
>>
>> Test time is always: 33 minutes and 20 seconds test (8 CPUs), and there
>> is never any noticeable difference in execution times for the work:
>>
>> Example 1: kernel 4.5-rc5 with the rjw 3 patch set version 10
>> "Replace timers with utilization update callbacks":
>> Aggregate times in each idle state:
>
> You need to say what workload that is too.

Sorry, it had been on other e-mails, anyway I do this
(compile the kernel, a few times):

#!/bin/dash

# compile_9 Smythies 2016.03.02
#       Add some idle stats.
#
# compile_9 Smythies 2016.02.28
#       Do 9 incremental compiles.
#       Just trying to be consistent between kernels.
#

cat /sys/devices/system/cpu/cpu*/cpuidle/*/time > ~/temp-doug/begin.txt

LOOPS=0
while [ $LOOPS -lt 9 ];
do
  time make -j9 olddefconfig bindeb-pkg LOCALVERSION=-test
  sleep 1
  LOOPS=$((LOOPS+1))
done

cat /sys/devices/system/cpu/cpu*/cpuidle/*/time > ~/temp-doug/end.txt

_________________________

No files have been touched since a previous compile.
The first compile takes about 8 or 9 minutes, and the rest
take about 2 minutes and 45 seconds, due to caching.
Total about 30 or 31 minutes.

Note that a clean compile takes about 20 minutes, but
that is not what I am doing.

Meanwhile, the following command is being run on
another SSH session (for 33 minutes and 20 seconds):

sudo turbostat -S -J --debug sleep 2000

>> rjwv10 (minutes)        reverted (minutes)      Idle State
>> 27.0654211              2.617325533                     0
>> 12.92451315             24.53266672                     1
>> 3.668558467             3.780482633                     2
>> 1.2727832               1.61352195                      3
>> 130.8342596             141.2234947                     4
>>
>> 175.7655355             173.7674915                     total
>>
>> Example 2: kernel 4.5-rc7 stock:
>> Aggregate times in each idle state:
>>
>> k45rc7 (minutes)        reverted (minutes)      Idle State
>> 20.1771917              2.638311483                     0
>> 13.02770225             21.81474838                     1
>> 3.428136783             3.951405                        2
>> 1.4540243               1.552488167                     3
>> 134.9057413             143.5533                        4
>>
>> 172.9927963             173.5102531                     total
>>
>> Energy (restated from a previous e-mail):
>>
>> Test 7: reverted: Package Joules: 47830
>> Test 8: rjwv10: Package Joules: 54419 (revert saves 12.1% energy)
>>
>> Test 9: reverted: Package Joules: 49326
>> Test 10: rjwv10: Package Joules: 55442 (revert saves 11% energy)
>>
>> Test 11: reverted: acpi-cpufreq ondemand: Package Joules: 49146
>> Test 12: rjwv10: acpi-cpufreq ondemand: Package Joules: 56302 (revert saves 12.7% energy)
>>
>> Energy (not in any previous e-mail):
>>
>> Reverted: 56178 Joules
>> Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)

> So the claim in commit a9ceb78bc75ca47972096372ff3d48648b16317a is
> that the change should not really affect systems with low C1
> latencies.
>
> I wonder what's the C1 latency on your test system?

>From that script from a Rik van Riel e-mail:

state 0 latency: 0
state 1 latency: 2
state 2 latency: 10
state 3 latency: 80
state 4 latency: 104



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-09 23:45           ` Doug Smythies
@ 2016-03-09 23:59             ` Rafael J. Wysocki
  2016-03-11 14:03               ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-09 23:59 UTC (permalink / raw)
  To: Doug Smythies, Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Thu, Mar 10, 2016 at 12:45 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.03.09 15:18 Rafael J. Wysocki wrote:
>> On Thu, Mar 10, 2016 at 12:03 AM, Doug Smythies <dsmythies@telus.net> wrote:
>>> Does the direction that this thread has taken mean that those two
>>> commits that Arto identified might now not be reverted?
>>
>> Yes.
>>
>>> Commits:
>>> 9c4b2867ed7c8c8784dd417ffd16e705e81eb145
>>> a9ceb78bc75ca47972096372ff3d48648b16317a
>>>
>>> Recall that increased energy consumption was also isolated to those
>>> two commits for some types of workflow. The suggestion is that the
>>> commits should be reverted anyhow.
>>>
>>> What seems to be happening is that CPUs are deciding to stay in idle
>>> state 0 a lot more, when they actually could / should be in a deeper
>>> idle state.
>>>
>>> Test time is always: 33 minutes and 20 seconds test (8 CPUs), and there
>>> is never any noticeable difference in execution times for the work:
>>>
>>> Example 1: kernel 4.5-rc5 with the rjw 3 patch set version 10
>>> "Replace timers with utilization update callbacks":
>>> Aggregate times in each idle state:
>>
>> You need to say what workload that is too.
>
> Sorry, it had been on other e-mails, anyway I do this
> (compile the kernel, a few times):
>
> #!/bin/dash
>
> # compile_9 Smythies 2016.03.02
> #       Add some idle stats.
> #
> # compile_9 Smythies 2016.02.28
> #       Do 9 incremental compiles.
> #       Just trying to be consistent between kernels.
> #
>
> cat /sys/devices/system/cpu/cpu*/cpuidle/*/time > ~/temp-doug/begin.txt
>
> LOOPS=0
> while [ $LOOPS -lt 9 ];
> do
>   time make -j9 olddefconfig bindeb-pkg LOCALVERSION=-test
>   sleep 1
>   LOOPS=$((LOOPS+1))
> done
>
> cat /sys/devices/system/cpu/cpu*/cpuidle/*/time > ~/temp-doug/end.txt
>
> _________________________
>
> No files have been touched since a previous compile.
> The first compile takes about 8 or 9 minutes, and the rest
> take about 2 minutes and 45 seconds, due to caching.
> Total about 30 or 31 minutes.
>
> Note that a clean compile takes about 20 minutes, but
> that is not what I am doing.
>
> Meanwhile, the following command is being run on
> another SSH session (for 33 minutes and 20 seconds):
>
> sudo turbostat -S -J --debug sleep 2000
>
>>> rjwv10 (minutes)        reverted (minutes)      Idle State
>>> 27.0654211              2.617325533                     0
>>> 12.92451315             24.53266672                     1
>>> 3.668558467             3.780482633                     2
>>> 1.2727832               1.61352195                      3
>>> 130.8342596             141.2234947                     4
>>>
>>> 175.7655355             173.7674915                     total
>>>
>>> Example 2: kernel 4.5-rc7 stock:
>>> Aggregate times in each idle state:
>>>
>>> k45rc7 (minutes)        reverted (minutes)      Idle State
>>> 20.1771917              2.638311483                     0
>>> 13.02770225             21.81474838                     1
>>> 3.428136783             3.951405                        2
>>> 1.4540243               1.552488167                     3
>>> 134.9057413             143.5533                        4
>>>
>>> 172.9927963             173.5102531                     total
>>>
>>> Energy (restated from a previous e-mail):
>>>
>>> Test 7: reverted: Package Joules: 47830
>>> Test 8: rjwv10: Package Joules: 54419 (revert saves 12.1% energy)
>>>
>>> Test 9: reverted: Package Joules: 49326
>>> Test 10: rjwv10: Package Joules: 55442 (revert saves 11% energy)
>>>
>>> Test 11: reverted: acpi-cpufreq ondemand: Package Joules: 49146
>>> Test 12: rjwv10: acpi-cpufreq ondemand: Package Joules: 56302 (revert saves 12.7% energy)
>>>
>>> Energy (not in any previous e-mail):
>>>
>>> Reverted: 56178 Joules
>>> Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)
>
>> So the claim in commit a9ceb78bc75ca47972096372ff3d48648b16317a is
>> that the change should not really affect systems with low C1
>> latencies.
>>
>> I wonder what's the C1 latency on your test system?
>
> From that script from a Rik van Riel e-mail:
>
> state 0 latency: 0
> state 1 latency: 2
> state 2 latency: 10
> state 3 latency: 80
> state 4 latency: 104

OK, thanks.

Rik, that seems to go against the changelog of
a9ceb78bc75ca47972096372ff3d48648b16317a:

"This is not a big deal on most x86 CPUs, which have very low C1
latencies, and the patch should not have any effect on those CPUs."

The effect is actually measurable and quite substantial to my eyes.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-09 23:59             ` Rafael J. Wysocki
@ 2016-03-11 14:03               ` Rik van Riel
  2016-03-11 18:22                 ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-11 14:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Thu, 10 Mar 2016 00:59:01 +0100
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> OK, thanks.
> 
> Rik, that seems to go against the changelog of
> a9ceb78bc75ca47972096372ff3d48648b16317a:
> 
> "This is not a big deal on most x86 CPUs, which have very low C1
> latencies, and the patch should not have any effect on those CPUs."
> 
> The effect is actually measurable and quite substantial to my eyes.

Indeed, my mistake was testing not just against the predicted
latency, but against the predicted latency multiplied by the
load correction factor, which can be as much as 10x the load...

The patch below should fix that.

It didn't for Arto, due to the other issues on his system, but
it might resolve the issue for Doug, where cstate/pstate is
otherwise working fine.

Doug, does the patch below solve your issue?

If it does not, we should figure out why the idle state selection
loop is not selecting the right mode.

Is the latency_req "load correction" too aggressive?

Or is it only too aggressive for IDLE->HLT selection, and fine to
drive choices between deeper C states?

After all, if it causes the IDLE->HLT selection to go wrong, maybe
it is also causing us to pick shallower C states when we should be
picking deeper ones?

        /*
         * Find the idle state with the lowest power while satisfying
         * our constraints.
         */
        for (i = data->last_state_idx + 1; i < drv->state_count; i++) {
                struct cpuidle_state *s = &drv->states[i];
                struct cpuidle_state_usage *su = &dev->states_usage[i];

                if (s->disabled || su->disable)
                        continue;
                if (s->target_residency > data->predicted_us)
                        continue;
                if (s->exit_latency > latency_req)
                        continue;

                data->last_state_idx = i;
        }

---8<---

Subject: cpuidle: use predicted_us not interactivity_req to consider polling

The interactivity_req variable is the expected sleep time, divided
by the CPU load. This can be too aggressive a factor in deciding
whether or not to consider polling in the cpuidle state selection.

Use the (not corrected for load) predicted_us instead.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/governors/menu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..97022ae01d2e 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -330,7 +330,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		 * We want to default to C1 (hlt), not to busy polling
 		 * unless the timer is happening really really soon.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->predicted_us > 20 &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-11 14:03               ` Rik van Riel
@ 2016-03-11 18:22                 ` Doug Smythies
  2016-03-11 20:30                   ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-11 18:22 UTC (permalink / raw)
  To: 'Rik van Riel', 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

On 2016.03.11 06:03 Rik van Riel wrote:
> On Thu, 10 Mar 2016 00:59:01 +0100 "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
>> Rik, that seems to go against the changelog of
>> a9ceb78bc75ca47972096372ff3d48648b16317a:
>> 
>> "This is not a big deal on most x86 CPUs, which have very low C1
>> latencies, and the patch should not have any effect on those CPUs."
>> 
>> The effect is actually measurable and quite substantial to my eyes.
>
> Indeed, my mistake was testing not just against the predicted
> latency, but against the predicted latency multiplied by the
> load correction factor, which can be as much as 10x the load...
>
> The patch below should fix that.
>
> It didn't for Arto, due to the other issues on his system, but
> it might resolve the issue for Doug, where cstate/pstate is
> otherwise working fine.
>
> Doug, does the patch below solve your issue?

No.

Old data restated with new data added below:
Aggregate times in each idle state for the 2000 second test:

k45rc7 (minutes)	reverted (mins)	rvr patch(mins)	State
20.1771917		2.638311483		19.11342298		0
13.02770225		21.81474838		13.55643397		1
3.428136783		3.951405		3.698494867		2
1.4540243		1.552488167		1.528558717		3
134.9057413		143.5533		138.5279812		4
			
172.9927963		173.5102531		176.4248918		total

>> Energy:
>>
>> Reverted: 56178 Joules
>> Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)
Kernel 4.5-rc7 + rvr patch: 62914 Joules

> If it does not, we should figure out why the idle state selection
> loop is not selecting the right mode.

For my part of it, I am struggling to understand this area of the code.
It would take me awhile, quite awhile, to be able to provide useful
input.

... Doug



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-11 18:22                 ` Doug Smythies
@ 2016-03-11 20:30                   ` Rik van Riel
  2016-03-11 23:54                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-11 20:30 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On Fri, 11 Mar 2016 10:22:30 -0800
"Doug Smythies" <dsmythies@telus.net> wrote:

> On 2016.03.11 06:03 Rik van Riel wrote:
> > On Thu, 10 Mar 2016 00:59:01 +0100 "Rafael J. Wysocki" <rafael@kernel.org> wrote:

> > The patch below should fix that.
> >
> > It didn't for Arto, due to the other issues on his system, but
> > it might resolve the issue for Doug, where cstate/pstate is
> > otherwise working fine.
> >
> > Doug, does the patch below solve your issue?
> 
> No.

OK, lets try doing this check a little more aggressively, since there
almost certainly are good reasons why the main selection loop in
menu_select() so aggressively tries to stay with shallower C states.

With the patch below, we compare the (not corrected for load) expected
sleep time against the target residency of the C1 (hlt) state on the CPU.

This might resolve the issue, while still doing the right thing on CPUs
that have really high C1 latencies (eg. Atom).

Does this resolve the issue for you?

---8<---

Subject: cpuidle: use predicted_us not interactivity_req to consider polling

The interactivity_req variable is the expected sleep time, divided
by the CPU load. This can be too aggressive a factor in deciding
whether or not to consider polling in the cpuidle state selection.

Use the (not corrected for load) predicted_us instead, and compare
it against the target residency of C1 (hlt).

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/governors/menu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..7dda15cfe547 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -330,7 +330,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		 * We want to default to C1 (hlt), not to busy polling
 		 * unless the timer is happening really really soon.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->predicted_us > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-11 20:30                   ` Rik van Riel
@ 2016-03-11 23:54                     ` Rafael J. Wysocki
  2016-03-12  0:46                       ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-11 23:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 11, 2016 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 11 Mar 2016 10:22:30 -0800
> "Doug Smythies" <dsmythies@telus.net> wrote:
>
>> On 2016.03.11 06:03 Rik van Riel wrote:
>> > On Thu, 10 Mar 2016 00:59:01 +0100 "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
>> > The patch below should fix that.
>> >
>> > It didn't for Arto, due to the other issues on his system, but
>> > it might resolve the issue for Doug, where cstate/pstate is
>> > otherwise working fine.
>> >
>> > Doug, does the patch below solve your issue?
>>
>> No.
>
> OK, lets try doing this check a little more aggressively, since there
> almost certainly are good reasons why the main selection loop in
> menu_select() so aggressively tries to stay with shallower C states.
>
> With the patch below, we compare the (not corrected for load) expected
> sleep time against the target residency of the C1 (hlt) state on the CPU.
>
> This might resolve the issue, while still doing the right thing on CPUs
> that have really high C1 latencies (eg. Atom).
>
> Does this resolve the issue for you?
>
> ---8<---
>
> Subject: cpuidle: use predicted_us not interactivity_req to consider polling
>
> The interactivity_req variable is the expected sleep time, divided
> by the CPU load. This can be too aggressive a factor in deciding
> whether or not to consider polling in the cpuidle state selection.
>
> Use the (not corrected for load) predicted_us instead, and compare
> it against the target residency of C1 (hlt).
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  drivers/cpuidle/governors/menu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 0742b3296673..7dda15cfe547 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -330,7 +330,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>                  * We want to default to C1 (hlt), not to busy polling
>                  * unless the timer is happening really really soon.
>                  */
> -               if (interactivity_req > 20 &&
> +               if (data->predicted_us > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&

If we do this, we probably should check the exit latency (against
latency_req) too.

>                     !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
>                         dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
>                         data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-11 23:54                     ` Rafael J. Wysocki
@ 2016-03-12  0:46                       ` Doug Smythies
  2016-03-12  1:45                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-12  0:46 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

On 2106.03.11 15:55 Rafael J. Wysocki wrote:
> On Fri, Mar 11, 2016 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
>> On Fri, 11 Mar 2016 10:22:30 -0800 Doug Smythies wrote:
>>> On 2016.03.11 06:03 Rik van Riel wrote:
>>
>>>> The patch below should fix that.
>>>>
>>>> It didn't for Arto, due to the other issues on his system, but
>>>> it might resolve the issue for Doug, where cstate/pstate is
>>>> otherwise working fine.
>>>>
>>>> Doug, does the patch below solve your issue?
>>>
>>> No.
>>
>> OK, lets try doing this check a little more aggressively, since there
>> almost certainly are good reasons why the main selection loop in
>> menu_select() so aggressively tries to stay with shallower C states.
>>
>> With the patch below, we compare the (not corrected for load) expected
>> sleep time against the target residency of the C1 (hlt) state on the CPU.
>>
>> This might resolve the issue, while still doing the right thing on CPUs
>> that have really high C1 latencies (eg. Atom).
>>
>> Does this resolve the issue for you?

No, but it seems better.

Old data restated with new data added below:
Aggregate times in each idle state for the 2000 second test:

State	k45rc7 (mins)	reverted (mins)	rvr(mins)	rvr2(mins)
0.00	20.18			2.64			19.11		14.09
1.00	13.03			21.81			13.56		12.05
2.00	3.43			3.95			3.70		3.93
3.00	1.45			1.55			1.53		1.52
4.00	134.91		143.55		138.53	142.80
				
total	172.99		173.51		176.42	174.37 

Energy (old restated, plus new added):

Kernel 4.5-rc7 Reverted: 56178 Joules
Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)
Kernel 4.5-rc7 + rvr patch: 62914 Joules
Kernel 4.5-rc7 + rvr patch version 2: 60416 Joules

> If we do this, we probably should check the exit latency
> (against latency_req) too.

I already had the above data before I noticed Rafael's e-mail.

... Doug



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-12  0:46                       ` Doug Smythies
@ 2016-03-12  1:45                         ` Rafael J. Wysocki
  2016-03-12  2:02                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-12  1:45 UTC (permalink / raw)
  To: Doug Smythies, 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

On Friday, March 11, 2016 04:46:53 PM Doug Smythies wrote:
> On 2106.03.11 15:55 Rafael J. Wysocki wrote:
> > On Fri, Mar 11, 2016 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
> >> On Fri, 11 Mar 2016 10:22:30 -0800 Doug Smythies wrote:
> >>> On 2016.03.11 06:03 Rik van Riel wrote:
> >>
> >>>> The patch below should fix that.
> >>>>
> >>>> It didn't for Arto, due to the other issues on his system, but
> >>>> it might resolve the issue for Doug, where cstate/pstate is
> >>>> otherwise working fine.
> >>>>
> >>>> Doug, does the patch below solve your issue?
> >>>
> >>> No.
> >>
> >> OK, lets try doing this check a little more aggressively, since there
> >> almost certainly are good reasons why the main selection loop in
> >> menu_select() so aggressively tries to stay with shallower C states.
> >>
> >> With the patch below, we compare the (not corrected for load) expected
> >> sleep time against the target residency of the C1 (hlt) state on the CPU.
> >>
> >> This might resolve the issue, while still doing the right thing on CPUs
> >> that have really high C1 latencies (eg. Atom).
> >>
> >> Does this resolve the issue for you?
> 
> No, but it seems better.
> 
> Old data restated with new data added below:
> Aggregate times in each idle state for the 2000 second test:
> 
> State	k45rc7 (mins)	reverted (mins)	rvr(mins)	rvr2(mins)
> 0.00	20.18			2.64			19.11		14.09
> 1.00	13.03			21.81			13.56		12.05

The difference is still significant, though.

> 2.00	3.43			3.95			3.70		3.93
> 3.00	1.45			1.55			1.53		1.52
> 4.00	134.91		143.55		138.53	142.80
> 				
> total	172.99		173.51		176.42	174.37 
> 
> Energy (old restated, plus new added):
> 
> Kernel 4.5-rc7 Reverted: 56178 Joules
> Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)
> Kernel 4.5-rc7 + rvr patch: 62914 Joules
> Kernel 4.5-rc7 + rvr patch version 2: 60416 Joules
> 
> > If we do this, we probably should check the exit latency
> > (against latency_req) too.
> 
> I already had the above data before I noticed Rafael's e-mail.

So having considered that for a while more I think that the original Rik's
commit tried to address a real problem, but in a way that wasn't really
adequate.

The original problem appears to be that on some systems the exit latency
of C1 (not the target residency, mind you) is too high and that causes
performance issues if we aggressively choose C1 instead of polling on
them.  So to address that directy, why don't we simply check the
exit latency against latency_req?

Something like the patch below, maybe?

---
 drivers/cpuidle/governors/menu.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -329,7 +329,8 @@ static int menu_select(struct cpuidle_dr
 		 * We want to default to C1 (hlt), not to busy polling
 		 * unless the timer is happening really really soon.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->next_timer_us > 20 &&
+		    !drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency > latency_req &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-12  1:45                         ` Rafael J. Wysocki
@ 2016-03-12  2:02                           ` Rafael J. Wysocki
  2016-03-13  7:46                             ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-12  2:02 UTC (permalink / raw)
  To: Doug Smythies, 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

On Saturday, March 12, 2016 02:45:42 AM Rafael J. Wysocki wrote:
> On Friday, March 11, 2016 04:46:53 PM Doug Smythies wrote:
> > On 2106.03.11 15:55 Rafael J. Wysocki wrote:
> > > On Fri, Mar 11, 2016 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
> > >> On Fri, 11 Mar 2016 10:22:30 -0800 Doug Smythies wrote:
> > >>> On 2016.03.11 06:03 Rik van Riel wrote:
> > >>
> > >>>> The patch below should fix that.
> > >>>>
> > >>>> It didn't for Arto, due to the other issues on his system, but
> > >>>> it might resolve the issue for Doug, where cstate/pstate is
> > >>>> otherwise working fine.
> > >>>>
> > >>>> Doug, does the patch below solve your issue?
> > >>>
> > >>> No.
> > >>
> > >> OK, lets try doing this check a little more aggressively, since there
> > >> almost certainly are good reasons why the main selection loop in
> > >> menu_select() so aggressively tries to stay with shallower C states.
> > >>
> > >> With the patch below, we compare the (not corrected for load) expected
> > >> sleep time against the target residency of the C1 (hlt) state on the CPU.
> > >>
> > >> This might resolve the issue, while still doing the right thing on CPUs
> > >> that have really high C1 latencies (eg. Atom).
> > >>
> > >> Does this resolve the issue for you?
> > 
> > No, but it seems better.
> > 
> > Old data restated with new data added below:
> > Aggregate times in each idle state for the 2000 second test:
> > 
> > State	k45rc7 (mins)	reverted (mins)	rvr(mins)	rvr2(mins)
> > 0.00	20.18			2.64			19.11		14.09
> > 1.00	13.03			21.81			13.56		12.05
> 
> The difference is still significant, though.
> 
> > 2.00	3.43			3.95			3.70		3.93
> > 3.00	1.45			1.55			1.53		1.52
> > 4.00	134.91		143.55		138.53	142.80
> > 				
> > total	172.99		173.51		176.42	174.37 
> > 
> > Energy (old restated, plus new added):
> > 
> > Kernel 4.5-rc7 Reverted: 56178 Joules
> > Kernel 4.5-rc7: 63269 Joules (revert saves 12.6% energy)
> > Kernel 4.5-rc7 + rvr patch: 62914 Joules
> > Kernel 4.5-rc7 + rvr patch version 2: 60416 Joules
> > 
> > > If we do this, we probably should check the exit latency
> > > (against latency_req) too.
> > 
> > I already had the above data before I noticed Rafael's e-mail.
> 
> So having considered that for a while more I think that the original Rik's
> commit tried to address a real problem, but in a way that wasn't really
> adequate.
> 
> The original problem appears to be that on some systems the exit latency
> of C1 (not the target residency, mind you) is too high and that causes
> performance issues if we aggressively choose C1 instead of polling on
> them.  So to address that directy, why don't we simply check the
> exit latency against latency_req?
> 
> Something like the patch below, maybe?
> 
> ---
>  drivers/cpuidle/governors/menu.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -329,7 +329,8 @@ static int menu_select(struct cpuidle_dr
>  		 * We want to default to C1 (hlt), not to busy polling
>  		 * unless the timer is happening really really soon.
>  		 */
> -		if (interactivity_req > 20 &&
> +		if (data->next_timer_us > 20 &&
> +		    !drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency > latency_req &&

Gosh, I'm too tired.  Parens missing and it can be written simpler using <=.

Tentatively-signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
-		 * unless the timer is happening really really soon.
+		 * unless the timer is happening really really soon.  Still, if
+		 * the exit latency of C1 is too high, we need to poll anyway.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->next_timer_us > 20 &&
+		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= latency_req &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
-			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	} else {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-12  2:02                           ` Rafael J. Wysocki
@ 2016-03-13  7:46                             ` Doug Smythies
  2016-03-14  1:32                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-13  7:46 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

On 2016.03.11 18:02 Rafael J. Wysocki wrote:
> On Saturday, March 12, 2016 02:45:42 AM Rafael J. Wysocki wrote:
>
> Gosh, I'm too tired.  Parens missing and it can be written simpler using <=.
>
> Tentatively-signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/cpuidle/governors/menu.c |    8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
> 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
> 		/*
> 		 * We want to default to C1 (hlt), not to busy polling
> -		 * unless the timer is happening really really soon.
> +		 * unless the timer is happening really really soon.  Still, if
> +		 * the exit latency of C1 is too high, we need to poll anyway.
> 		 */
> -		if (interactivity_req > 20 &&
> +		if (data->next_timer_us > 20 &&
> +		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= latency_req &&
> 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
> -			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> +		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
> 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
> 	} else {
> 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

Note 1: Above = rvr3 (because I already have a bunch of rjw kernels for other stuff).
Note 2: reference tests re-done, using Rafael's 3 patch set version 10
"cpufreq: Replace timers with utilization update callbacks".
Why? Because it was desirable to eliminate intel_pstate long durations
between calls that were due to the CPU being idle on jiffy boundaries,
but otherwise busy.
Well why was that desirable? So that trace could be acquired where
we could be reasonably confident that most very high CPU loads combined
with very long durations were due to long periods in idle state 0.

Aggregate times in each idle state for the 2000 second test:
State	k45rc7-rjw10 (mins)	k45rc7-rjw10-reverted (mins)	k45rc7-rjw10-rcr3 (mins)
0.00	18.07				0.92					18.38
1.00	12.35				19.51					13.16
2.00	3.96				4.28					2.91
3.00	1.55				1.53					1.00
4.00	138.96			141.99				115.41

total	174.90			168.24				150.87

Energy:
Kernel 4.5-rc7-rjw10: 61983 Joules
Kernel 4.5-rc7-rjw10-reverted: 48409 Joules
Kernel 4.5-rc7-rjw10-rvr3: 62938 Joules 

Isn't the issue here just that it can be just so very expensive, in terms
of energy, when the decision is made to poll instead of HLT or deeper?
It doesn't have to happen all that often, where the CPU is basically abandoned
in that state, because it can be there for up to 200,000 times longer
than was expected (4 seconds instead of <20 usecs), per occurrence.

An intel_pstate trace was obtained for the above "k45rc7-rjw10-rvr3" (Kernel
4.5-rc7 with Rafael's 3 patch set version 10 and the above suggested patch).
In 2000 seconds there were about 3164 long durations at high CPU load (in this
context meaning the CPU was actually idle, but was in idle state 0) accounting
for 17.15 minutes out of the above listed 18.38 minutes. For example:

CPU 6: mperf: 6672329686; aperf: 6921452881; load: 99.83% duration: 1.96 seconds.
CPU 5: mperf: 7591407713; aperf: 5651758618; load: 99.87% duration: 2.23 seconds.

An intel_pstate trace was obtained for the above "k45rc7-rjw10-reverted" (Kernel
4.5-rc7 with Rafael's 3 patch set version 10 and commits
9c4b2867ed7c8c8784dd417ffd16e705e81eb145 and
a9ceb78bc75ca47972096372ff3d48648b16317a reverted).
In 2000 seconds there were about 237 long durations at high CPU load (in this
context meaning the CPU was actually idle, but was in idle state 0)
totaling 3.42 minutes, or more than can be accounted for above.
However, if I compensate for actual load (which is consistently less in the 
237 samples (meaning it wasn't actually always in state 0 during that time)
and take out some of the watchdog limit hits at the end, because
the trace was longer than the actual idle states collection time, it goes
to 0.35 minutes.

... Doug

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-13  7:46                             ` Doug Smythies
@ 2016-03-14  1:32                               ` Rafael J. Wysocki
  2016-03-14  6:39                                 ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-14  1:32 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rik van Riel', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On Saturday, March 12, 2016 11:46:03 PM Doug Smythies wrote:
> On 2016.03.11 18:02 Rafael J. Wysocki wrote:
> > On Saturday, March 12, 2016 02:45:42 AM Rafael J. Wysocki wrote:
> >
> > Gosh, I'm too tired.  Parens missing and it can be written simpler using <=.
> >
> > Tentatively-signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > drivers/cpuidle/governors/menu.c |    8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > Index: linux-pm/drivers/cpuidle/governors/menu.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> > +++ linux-pm/drivers/cpuidle/governors/menu.c
> > @@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
> > 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
> > 		/*
> > 		 * We want to default to C1 (hlt), not to busy polling
> > -		 * unless the timer is happening really really soon.
> > +		 * unless the timer is happening really really soon.  Still, if
> > +		 * the exit latency of C1 is too high, we need to poll anyway.
> > 		 */
> > -		if (interactivity_req > 20 &&
> > +		if (data->next_timer_us > 20 &&
> > +		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= latency_req &&
> > 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
> > -			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> > +		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
> > 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
> > 	} else {
> > 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
> 
> Note 1: Above = rvr3 (because I already have a bunch of rjw kernels for other stuff).
> Note 2: reference tests re-done, using Rafael's 3 patch set version 10
> "cpufreq: Replace timers with utilization update callbacks".
> Why? Because it was desirable to eliminate intel_pstate long durations
> between calls that were due to the CPU being idle on jiffy boundaries,
> but otherwise busy.
> Well why was that desirable? So that trace could be acquired where
> we could be reasonably confident that most very high CPU loads combined
> with very long durations were due to long periods in idle state 0.
> 
> Aggregate times in each idle state for the 2000 second test:
> State	k45rc7-rjw10 (mins)	k45rc7-rjw10-reverted (mins)	k45rc7-rjw10-rcr3 (mins)
> 0.00	18.07				0.92					18.38
> 1.00	12.35				19.51					13.16
> 2.00	3.96				4.28					2.91
> 3.00	1.55				1.53					1.00
> 4.00	138.96			141.99				115.41
> 			
> total	174.90			168.24				150.87
> 
> Energy:
> Kernel 4.5-rc7-rjw10: 61983 Joules
> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules
> Kernel 4.5-rc7-rjw10-rvr3: 62938 Joules 

OK, thanks for the report.

This seems to mean that our latency_req is just off for short latencies (I guess
it simply is 0, because the performance multiplier is likely to be greater than
predicted_us for small values of that).

The patch below should restore the original behavior for you.  Can you please
test it?

---
 drivers/cpuidle/governors/menu.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
-		 * unless the timer is happening really really soon.
+		 * unless the timer is happening really really soon.  Still, if
+		 * the exit latency of C1 is too high, we need to poll anyway.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->next_timer_us > 20 &&
+		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
-			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	} else {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14  1:32                               ` Rafael J. Wysocki
@ 2016-03-14  6:39                                 ` Doug Smythies
  2016-03-14 12:47                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-14  6:39 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rik van Riel', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On 2106.03.13 18:32 Rafael J. Wysocki write:
> On Saturday, March 12, 2016 11:46:03 PM Doug Smythies wrote:

>> Note 1: Above = rvr3 (because I already have a bunch of rjw kernels for other stuff).
>> Note 2: reference tests re-done, using Rafael's 3 patch set version 10
>> "cpufreq: Replace timers with utilization update callbacks".
>> Why? Because it was desirable to eliminate intel_pstate long durations
>> between calls that were due to the CPU being idle on jiffy boundaries,
>> but otherwise busy.
>> Well why was that desirable? So that trace could be acquired where
>> we could be reasonably confident that most very high CPU loads combined
>> with very long durations were due to long periods in idle state 0.
>> 
>> Aggregate times in each idle state for the 2000 second test:
>> State	k45rc7-rjw10 (mins)	k45rc7-rjw10-reverted (mins)	k45rc7-rjw10-rvr3 (mins)
>> 0.00	18.07				0.92					18.38
>> 1.00	12.35				19.51					13.16
>> 2.00	3.96				4.28					2.91
>> 3.00	1.55				1.53					1.00
>> 4.00	138.96			141.99				115.41
>> 			
>> total	174.90			168.24				150.87
>> 
>> Energy:
>> Kernel 4.5-rc7-rjw10: 61983 Joules
>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules
>> Kernel 4.5-rc7-rjw10-rvr3: 62938 Joules 
>
> OK, thanks for the report.
>
> This seems to mean that our latency_req is just off for short latencies (I guess
> it simply is 0, because the performance multiplier is likely to be greater than
> predicted_us for small values of that).
>
> The patch below should restore the original behavior for you.  Can you please
> test it?

Yes, it seems O.K. (reference rvr4 = below patch).

State	k45rc7-rjw10-rvr4 (mins)
0.00	1.79
1.00	21.82
2.00	4.26
3.00	1.64
4.00	148.77
			
total	178.29

Energy: 55743 Joules.
Note: there were other systems changes today, and energy has variations anyhow.
For example another run of k45rc7-rjw10-rcr3 was 55040 Joules, but still 0.13
minutes in idle state 0.

Note: I broke something today and, due to some missing shared library,
trace does not work, so no trace was acquired.

> ---
> drivers/cpuidle/governors/menu.c |    8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
> 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
>  		/*
> 		 * We want to default to C1 (hlt), not to busy polling
> -		 * unless the timer is happening really really soon.
> +		 * unless the timer is happening really really soon.  Still, if
> +		 * the exit latency of C1 is too high, we need to poll anyway.
> 		 */
> -		if (interactivity_req > 20 &&
> +		if (data->next_timer_us > 20 &&
> +		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
> 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
> -			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> +		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
>  			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
> 	} else {
>  		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14  6:39                                 ` Doug Smythies
@ 2016-03-14 12:47                                   ` Rafael J. Wysocki
  2016-03-14 14:31                                     ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-14 12:47 UTC (permalink / raw)
  To: Doug Smythies, Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Mon, Mar 14, 2016 at 7:39 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2106.03.13 18:32 Rafael J. Wysocki write:
>> On Saturday, March 12, 2016 11:46:03 PM Doug Smythies wrote:
>
>>> Note 1: Above = rvr3 (because I already have a bunch of rjw kernels for other stuff).
>>> Note 2: reference tests re-done, using Rafael's 3 patch set version 10
>>> "cpufreq: Replace timers with utilization update callbacks".
>>> Why? Because it was desirable to eliminate intel_pstate long durations
>>> between calls that were due to the CPU being idle on jiffy boundaries,
>>> but otherwise busy.
>>> Well why was that desirable? So that trace could be acquired where
>>> we could be reasonably confident that most very high CPU loads combined
>>> with very long durations were due to long periods in idle state 0.
>>>
>>> Aggregate times in each idle state for the 2000 second test:
>>> State        k45rc7-rjw10 (mins)     k45rc7-rjw10-reverted (mins)    k45rc7-rjw10-rvr3 (mins)
>>> 0.00 18.07                           0.92                                    18.38
>>> 1.00 12.35                           19.51                                   13.16
>>> 2.00 3.96                            4.28                                    2.91
>>> 3.00 1.55                            1.53                                    1.00
>>> 4.00 138.96                  141.99                          115.41
>>>
>>> total        174.90                  168.24                          150.87
>>>
>>> Energy:
>>> Kernel 4.5-rc7-rjw10: 61983 Joules
>>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules
>>> Kernel 4.5-rc7-rjw10-rvr3: 62938 Joules
>>
>> OK, thanks for the report.
>>
>> This seems to mean that our latency_req is just off for short latencies (I guess
>> it simply is 0, because the performance multiplier is likely to be greater than
>> predicted_us for small values of that).
>>
>> The patch below should restore the original behavior for you.  Can you please
>> test it?
>
> Yes, it seems O.K. (reference rvr4 = below patch).
>
> State   k45rc7-rjw10-rvr4 (mins)
> 0.00    1.79
> 1.00    21.82
> 2.00    4.26
> 3.00    1.64
> 4.00    148.77
>
> total   178.29
>
> Energy: 55743 Joules.

OK, thanks!

Rik, I think what we can do is to put a constant limit on the exit
latency of C1 like in this patch.  I'm not sure if 2 is good enough,
though.  It can't be less than 2, because that's what the majority of
systems have, but maybe it can be greater?  Any ideas?

> Note: there were other systems changes today, and energy has variations anyhow.
> For example another run of k45rc7-rjw10-rcr3 was 55040 Joules, but still 0.13
> minutes in idle state 0.
>
> Note: I broke something today and, due to some missing shared library,
> trace does not work, so no trace was acquired.
>
>> ---
>> drivers/cpuidle/governors/menu.c |    8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> Index: linux-pm/drivers/cpuidle/governors/menu.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
>> +++ linux-pm/drivers/cpuidle/governors/menu.c
>> @@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
>>               data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
>>               /*
>>                * We want to default to C1 (hlt), not to busy polling
>> -              * unless the timer is happening really really soon.
>> +              * unless the timer is happening really really soon.  Still, if
>> +              * the exit latency of C1 is too high, we need to poll anyway.
>>                */
>> -             if (interactivity_req > 20 &&
>> +             if (data->next_timer_us > 20 &&
>> +                 drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
>>                   !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
>> -                     dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
>> +                 !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
>>                       data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>>       } else {
>>               data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>
>

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14 12:47                                   ` Rafael J. Wysocki
@ 2016-03-14 14:31                                     ` Rik van Riel
  2016-03-14 15:21                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-14 14:31 UTC (permalink / raw)
  To: Rafael J. Wysocki, Doug Smythies
  Cc: Rafael J. Wysocki, Viresh Kumar, Srinivas Pandruvada, Chen, Yu C,
	linux-pm, Arto Jantunen, Len Brown

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

On Mon, 2016-03-14 at 13:47 +0100, Rafael J. Wysocki wrote:
> 
> Rik, I think what we can do is to put a constant limit on the exit
> latency of C1 like in this patch.  I'm not sure if 2 is good enough,
> though.  It can't be less than 2, because that's what the majority of
> systems have, but maybe it can be greater?  Any ideas?

From Doug's description, it sounds like the problem is not
that we go into idle state 0 too often, but that sometimes
we stay there too long.

I wonder if it makes sense for idle state 0 to promote
itself to idle state 1 after some amount of time?

Maybe something like 2x the exit latency (or target
residency?) of C1?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14 14:31                                     ` Rik van Riel
@ 2016-03-14 15:21                                       ` Rafael J. Wysocki
  2016-03-14 17:45                                         ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-14 15:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Mon, Mar 14, 2016 at 3:31 PM, Rik van Riel <riel@redhat.com> wrote:
> On Mon, 2016-03-14 at 13:47 +0100, Rafael J. Wysocki wrote:
>>
>> Rik, I think what we can do is to put a constant limit on the exit
>> latency of C1 like in this patch.  I'm not sure if 2 is good enough,
>> though.  It can't be less than 2, because that's what the majority of
>> systems have, but maybe it can be greater?  Any ideas?
>
> From Doug's description, it sounds like the problem is not
> that we go into idle state 0 too often, but that sometimes
> we stay there too long.
>
> I wonder if it makes sense for idle state 0 to promote
> itself to idle state 1 after some amount of time?

That would be equivalent to setting up a timer to fire at the
promotion point, so the data->next_timer_us > 20 check sort of plays
the role of that (we go for polling only when we know there is a timer
to fire soon enough).

> Maybe something like 2x the exit latency (or target
> residency?) of C1?

If the goal is to avoid entering C1 if the exit latency is too high,
we should be checking the exit latency IMO.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14 15:21                                       ` Rafael J. Wysocki
@ 2016-03-14 17:45                                         ` Rik van Riel
  2016-03-14 22:55                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-14 17:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On Mon, 2016-03-14 at 16:21 +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 14, 2016 at 3:31 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > Maybe something like 2x the exit latency (or target
> > residency?) of C1?
> 
> If the goal is to avoid entering C1 if the exit latency is too high,
> we should be checking the exit latency IMO.

I agree with that, but doesn't the current code already
do that?

The code currently checks against both the latency_req
and the expected sleep time. What is the rationale for
adding another latency to check against?

I am trying to understand what the goal is here, and
under what circumstances the logic in the main selection
loop needs to be overridden.

What is the main selection loop doing wrong?

When is it causing bad outcomes?

Is it causing other bad outcomes beyond C0/C1 selection?

Can we just ignore polling and always go to HLT, leaving
out this (somewhat kludgy feeling) part of the code?

If HLT is sometimes too slow on Atom, is it also too slow
for certain workloads on faster CPUs? (say 10Gbps ethernet?)

A lot of the logic (especially the load correction) in
menu.c was last fine tuned before several bugs in the
corresponding measuring code were fixed.

I don't understand your patch well enough to object to
it, but the "start the loop at poll instead of HLT"
logic is becoming rather convoluted...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14 17:45                                         ` Rik van Riel
@ 2016-03-14 22:55                                           ` Rafael J. Wysocki
  2016-03-15  2:03                                             ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-14 22:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Mon, Mar 14, 2016 at 6:45 PM, Rik van Riel <riel@redhat.com> wrote:
> On Mon, 2016-03-14 at 16:21 +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 14, 2016 at 3:31 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > Maybe something like 2x the exit latency (or target
>> > residency?) of C1?
>>
>> If the goal is to avoid entering C1 if the exit latency is too high,
>> we should be checking the exit latency IMO.
>
> I agree with that, but doesn't the current code already
> do that?

For reference, here's what the current code does:

        data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
        /*
         * We want to default to C1 (hlt), not to busy polling
         * unless the timer is happening really really soon.
         */
        if (interactivity_req > 20 &&
            !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
            dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
            data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

> The code currently checks against both the latency_req
> and the expected sleep time. What is the rationale for
> adding another latency to check against?

Yes, it does check that in the main selection loop, but the check here
(and in question) is special.

If we end up using last_state_idx == 0 (polling), C1 will still be
considered in the main selection loop and it will be selected if it
satisfies all of the conditions.

If we use last_state_idx == CPUIDLE_DRIVER_STATE_START here, however,
C1 won't be looked at again in the main selection loop, so the check
here is to see whether or not to use C1 even if it *does* *not*
satisfy all of the normal conditions.

Granted, it generally is beneficial to use C1 rather than polling
unless C1 is hopelessly slow and that is visible in the Doug's
results.  Also granted, the way that piece of code worked before was
too aggressive, because it didn't take the "C1 may be hopelessly slow"
part into account at all.

> I am trying to understand what the goal is here, and
> under what circumstances the logic in the main selection
> loop needs to be overridden.
>
> What is the main selection loop doing wrong?
>
> When is it causing bad outcomes?
>
> Is it causing other bad outcomes beyond C0/C1 selection?
>
> Can we just ignore polling and always go to HLT, leaving
> out this (somewhat kludgy feeling) part of the code?

Len tells me that this would have hurt performance.

> If HLT is sometimes too slow on Atom, is it also too slow
> for certain workloads on faster CPUs? (say 10Gbps ethernet?)

Right.

> A lot of the logic (especially the load correction) in
> menu.c was last fine tuned before several bugs in the
> corresponding measuring code were fixed.
>
> I don't understand your patch well enough to object to
> it, but the "start the loop at poll instead of HLT"
> logic is becoming rather convoluted...

That logic has been there for quite a while, though.

Really, the choice is between a simple revert of commit
a9ceb78bc75ca47972096372ff3d48 and doing something in addition to it
to mitigate the "C1 is hopelessly slow" issue.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-14 22:55                                           ` Rafael J. Wysocki
@ 2016-03-15  2:03                                             ` Rik van Riel
  2016-03-16  0:32                                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-15  2:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Mon, 2016-03-14 at 23:55 +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 14, 2016 at 6:45 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > A lot of the logic (especially the load correction) in
> > menu.c was last fine tuned before several bugs in the
> > corresponding measuring code were fixed.
> > 
> > I don't understand your patch well enough to object to
> > it, but the "start the loop at poll instead of HLT"
> > logic is becoming rather convoluted...
> 
> That logic has been there for quite a while, though.
> 
> Really, the choice is between a simple revert of commit
> a9ceb78bc75ca47972096372ff3d48 and doing something in addition to it
> to mitigate the "C1 is hopelessly slow" issue.

That is fair.  I suspect your patch will be better than
a revert on systems where C1 is really slow, and just as
good as a revert on systems where C1 is fast.

We can worry about the choices between deeper C states
in a future version, but for 4.5 your patch would be the
way to go.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-15  2:03                                             ` Rik van Riel
@ 2016-03-16  0:32                                               ` Rafael J. Wysocki
  2016-03-16  0:37                                                 ` Rafael J. Wysocki
  2016-03-16  0:55                                                 ` Rik van Riel
  0 siblings, 2 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16  0:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 3482 bytes --]

On Monday, March 14, 2016 10:03:15 PM Rik van Riel wrote:
> On Mon, 2016-03-14 at 23:55 +0100, Rafael J. Wysocki wrote:
> > On Mon, Mar 14, 2016 at 6:45 PM, Rik van Riel <riel@redhat.com>
> > wrote:
> > > 
> > > A lot of the logic (especially the load correction) in
> > > menu.c was last fine tuned before several bugs in the
> > > corresponding measuring code were fixed.
> > > 
> > > I don't understand your patch well enough to object to
> > > it, but the "start the loop at poll instead of HLT"
> > > logic is becoming rather convoluted...
> > 
> > That logic has been there for quite a while, though.
> > 
> > Really, the choice is between a simple revert of commit
> > a9ceb78bc75ca47972096372ff3d48 and doing something in addition to it
> > to mitigate the "C1 is hopelessly slow" issue.
> 
> That is fair.  I suspect your patch will be better than
> a revert on systems where C1 is really slow, and just as
> good as a revert on systems where C1 is fast.
> 
> We can worry about the choices between deeper C states
> in a future version, but for 4.5 your patch would be the
> way to go.

OK, so here it goes again with a changelog etc.

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpuidle: menu: 

Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable
polling) modified the condition to decide if C1 should be used as
the fallback state instead of polling to check the value of
interactivity_req rather than the time to the next timer event.
That turned out to cause polling to be used as the fallback state
most of the time, though, so idle state residencies dropped
significantly even on systems with relatively low C1 exit latencies
which resulted in increased energy consumption.

That was not the intention of commit a9ceb78bc75c, so restore the
time to the next timer event check replaced by it, but in order
to mitigate the original problem it attempted to address, add an
extra C1 exit latency check to the condition in question, so C1
is not used as the fallback state if its exit latency is too high.

Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling)
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
-		 * unless the timer is happening really really soon.
+		 * unless the timer is happening really really soon.  Still, if
+		 * the exit latency of C1 is too high, we need to poll anyway.
 		 */
-		if (interactivity_req > 20 &&
+		if (data->next_timer_us > 20 &&
+		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
-			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	} else {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16  0:32                                               ` Rafael J. Wysocki
@ 2016-03-16  0:37                                                 ` Rafael J. Wysocki
  2016-03-16  0:55                                                 ` Rik van Riel
  1 sibling, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16  0:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 3722 bytes --]

On Wednesday, March 16, 2016 01:32:19 AM Rafael J. Wysocki wrote:
> On Monday, March 14, 2016 10:03:15 PM Rik van Riel wrote:
> > On Mon, 2016-03-14 at 23:55 +0100, Rafael J. Wysocki wrote:
> > > On Mon, Mar 14, 2016 at 6:45 PM, Rik van Riel <riel@redhat.com>
> > > wrote:
> > > > 
> > > > A lot of the logic (especially the load correction) in
> > > > menu.c was last fine tuned before several bugs in the
> > > > corresponding measuring code were fixed.
> > > > 
> > > > I don't understand your patch well enough to object to
> > > > it, but the "start the loop at poll instead of HLT"
> > > > logic is becoming rather convoluted...
> > > 
> > > That logic has been there for quite a while, though.
> > > 
> > > Really, the choice is between a simple revert of commit
> > > a9ceb78bc75ca47972096372ff3d48 and doing something in addition to it
> > > to mitigate the "C1 is hopelessly slow" issue.
> > 
> > That is fair.  I suspect your patch will be better than
> > a revert on systems where C1 is really slow, and just as
> > good as a revert on systems where C1 is fast.
> > 
> > We can worry about the choices between deeper C states
> > in a future version, but for 4.5 your patch would be the
> > way to go.
> 
> OK, so here it goes again with a changelog etc.
> 
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] cpuidle: menu: 

Ah, missing subject.

What about this: cpuidle: menu: Restore next_timer_us check in menu_select()

> Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable
> polling) modified the condition to decide if C1 should be used as
> the fallback state instead of polling to check the value of
> interactivity_req rather than the time to the next timer event.
> That turned out to cause polling to be used as the fallback state
> most of the time, though, so idle state residencies dropped
> significantly even on systems with relatively low C1 exit latencies
> which resulted in increased energy consumption.
> 
> That was not the intention of commit a9ceb78bc75c, so restore the
> time to the next timer event check replaced by it, but in order
> to mitigate the original problem it attempted to address, add an
> extra C1 exit latency check to the condition in question, so C1
> is not used as the fallback state if its exit latency is too high.
> 
> Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling)
> Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpuidle/governors/menu.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -327,11 +327,13 @@ static int menu_select(struct cpuidle_dr
>  		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
>  		/*
>  		 * We want to default to C1 (hlt), not to busy polling
> -		 * unless the timer is happening really really soon.
> +		 * unless the timer is happening really really soon.  Still, if
> +		 * the exit latency of C1 is too high, we need to poll anyway.
>  		 */
> -		if (interactivity_req > 20 &&
> +		if (data->next_timer_us > 20 &&
> +		    drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
>  		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
> -			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> +		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
>  			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>  	} else {
>  		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16  0:32                                               ` Rafael J. Wysocki
  2016-03-16  0:37                                                 ` Rafael J. Wysocki
@ 2016-03-16  0:55                                                 ` Rik van Riel
  2016-03-16  1:53                                                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-16  0:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

On Wed, 2016-03-16 at 01:32 +0100, Rafael J. Wysocki wrote:
> 
>  		 * We want to default to C1 (hlt), not to busy
> polling
> -		 * unless the timer is happening really really soon.
> +		 * unless the timer is happening really really
> soon.  Still, if
> +		 * the exit latency of C1 is too high, we need to
> poll anyway.
>  		 */
> -		if (interactivity_req > 20 &&
> +		if (data->next_timer_us > 20 &&
> +		    drv-
> >states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
>  		    !drv-

I wonder if replacing these two tests with
           data->next_timer_us
> states[CPUIDLE_DRIVER_STATE_START].exit_latency
would do the right thing?

That way any time we do not know for sure we are waking
up before the HLT exit latency has gone, we will poll.

Any time we suspect we will wake up sooner (data->predicted_us
is smaller than the HLT exit latency) we will still pick HLT,
anyway.

This makes the behaviour consistent between CPUs.

> >states[CPUIDLE_DRIVER_STATE_START].disabled &&
> -			dev-
> >states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> +		    !dev-
> >states_usage[CPUIDLE_DRIVER_STATE_START].disable)
>  			data->last_state_idx =
> CPUIDLE_DRIVER_STATE_START;
>  	} else {
>  		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16  0:55                                                 ` Rik van Riel
@ 2016-03-16  1:53                                                   ` Rafael J. Wysocki
  2016-03-16 13:02                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16  1:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Doug Smythies,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Wed, Mar 16, 2016 at 1:55 AM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2016-03-16 at 01:32 +0100, Rafael J. Wysocki wrote:
>>
>>                * We want to default to C1 (hlt), not to busy
>> polling
>> -              * unless the timer is happening really really soon.
>> +              * unless the timer is happening really really
>> soon.  Still, if
>> +              * the exit latency of C1 is too high, we need to
>> poll anyway.
>>                */
>> -             if (interactivity_req > 20 &&
>> +             if (data->next_timer_us > 20 &&
>> +                 drv-
>> >states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
>>                   !drv-
>
> I wonder if replacing these two tests with
>            data->next_timer_us
>> states[CPUIDLE_DRIVER_STATE_START].exit_latency
> would do the right thing?
>
> That way any time we do not know for sure we are waking
> up before the HLT exit latency has gone, we will poll.
>
> Any time we suspect we will wake up sooner (data->predicted_us
> is smaller than the HLT exit latency) we will still pick HLT,
> anyway.
>
> This makes the behaviour consistent between CPUs.

The idea makes sense to me, but I'd test against the sum of C1 exit
latency and C1 target residency, ie

data->next_timer_us > states[CPUIDLE_DRIVER_STATE_START].exit_latency
+ states[CPUIDLE_DRIVER_STATE_START].target_residency

because that's the time that has to pass before the next timer event
for us to make sense to go to C1 from the energy efficiency
perspective (if we are woken up earlier, we'll burn more energy that
we would if we didn't enter C1 at all, at least in theory).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16  1:53                                                   ` Rafael J. Wysocki
@ 2016-03-16 13:02                                                     ` Rafael J. Wysocki
  2016-03-16 14:01                                                       ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 13:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rik van Riel, Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Wed, Mar 16, 2016 at 2:53 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 16, 2016 at 1:55 AM, Rik van Riel <riel@redhat.com> wrote:
>> On Wed, 2016-03-16 at 01:32 +0100, Rafael J. Wysocki wrote:
>>>
>>>                * We want to default to C1 (hlt), not to busy
>>> polling
>>> -              * unless the timer is happening really really soon.
>>> +              * unless the timer is happening really really
>>> soon.  Still, if
>>> +              * the exit latency of C1 is too high, we need to
>>> poll anyway.
>>>                */
>>> -             if (interactivity_req > 20 &&
>>> +             if (data->next_timer_us > 20 &&
>>> +                 drv-
>>> >states[CPUIDLE_DRIVER_STATE_START].exit_latency <= 2 &&
>>>                   !drv-
>>
>> I wonder if replacing these two tests with
>>            data->next_timer_us
>>> states[CPUIDLE_DRIVER_STATE_START].exit_latency
>> would do the right thing?
>>
>> That way any time we do not know for sure we are waking
>> up before the HLT exit latency has gone, we will poll.
>>
>> Any time we suspect we will wake up sooner (data->predicted_us
>> is smaller than the HLT exit latency) we will still pick HLT,
>> anyway.
>>
>> This makes the behaviour consistent between CPUs.
>
> The idea makes sense to me, but I'd test against the sum of C1 exit
> latency and C1 target residency, ie
>
> data->next_timer_us > states[CPUIDLE_DRIVER_STATE_START].exit_latency
> + states[CPUIDLE_DRIVER_STATE_START].target_residency
>
> because that's the time that has to pass before the next timer event
> for us to make sense to go to C1 from the energy efficiency
> perspective (if we are woken up earlier, we'll burn more energy that
> we would if we didn't enter C1 at all, at least in theory).

My concern here is that this might defeat the original purpose of
commit a9ceb78bc75c, though, because it would make us use C1 more
aggressively in general on everything where the sum of C1 exit latency
and C1 target residency is less than 20.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 13:02                                                     ` Rafael J. Wysocki
@ 2016-03-16 14:01                                                       ` Rik van Riel
  2016-03-16 14:14                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-16 14:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

On Wed, 16 Mar 2016 14:02:33 +0100
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> My concern here is that this might defeat the original purpose of
> commit a9ceb78bc75c, though, because it would make us use C1 more
> aggressively in general on everything where the sum of C1 exit latency
> and C1 target residency is less than 20.

I just got this email from the test robot this morning.

Apparently the test patch I sent to Arto results in a 6.5%
improvement on blogbench, so there is a performance benefit
to polling when we expect short sleeps (not just a short
timer timeout, but corrected for typical interval, etc).

On the other hand, being more aggressive about C1 allows
other cores to run at higher frequencies...

The patch:

https://github.com/0day-ci/linux/commit/efd3a8716c71cfde82e5329216077f3e1530b66f

The kernel test robot results:



[-- Attachment #2: testbot-cpuidle --]
[-- Type: application/octet-stream, Size: 22220 bytes --]

Return-Path: xiaolong.ye@intel.com
Received: from zmta06.collab.prod.int.phx2.redhat.com (LHLO
 zmta06.collab.prod.int.phx2.redhat.com) (10.5.81.13) by
 zmail14.collab.prod.int.phx2.redhat.com with LMTP; Wed, 16 Mar 2016
 02:53:04 -0400 (EDT)
Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by zmta06.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id 2478E166446
	for <riel@mail.corp.redhat.com>; Wed, 16 Mar 2016 02:53:04 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx07.extmail.prod.ext.phx2.redhat.com [10.5.110.31])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u2G6r4Db025349
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
	for <riel@redhat.com>; Wed, 16 Mar 2016 02:53:04 -0400
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	by mx1.redhat.com (Postfix) with ESMTP id B9EF0C049D68
	for <riel@redhat.com>; Wed, 16 Mar 2016 06:53:02 +0000 (UTC)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by fmsmga104.fm.intel.com with ESMTP; 15 Mar 2016 23:53:03 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.24,343,1455004800"; 
   d="yaml'?scan'208";a="911697699"
Received: from yexl-desktop.sh.intel.com (HELO localhost) ([10.239.159.26])
  by orsmga001.jf.intel.com with ESMTP; 15 Mar 2016 23:53:00 -0700
Date: Wed, 16 Mar 2016 14:52:33 +0800
From: kernel test robot <xiaolong.ye@intel.com>
Sender: lkp-request@eclists.intel.com
To: Rik van Riel <riel@redhat.com>
Cc: 0day robot <fengguang.wu@intel.com>, LKML <linux-kernel@vger.kernel.org>,
        lkp@01.org
Subject: [lkp] [SKL BOOT FAILURE unless idle=nomwait (was Re] 256126a34c:
 +6.5% blogbench.write_score
Message-ID: <20160316065233.GC28484@yexl-desktop>
Reply-To: kernel test robot <xiaolong.ye@intel.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="LZvS9be/3tNcYl/X"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Heirloom mailx 12.5 6/20/10
X-RedHat-Spam-Score: 0.378  (BAYES_50,DCC_REPUT_00_12,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD,SPF_PASS) 192.55.52.120 mga04.intel.com 192.55.52.120 mga04.intel.com <xiaolong.ye@intel.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-Scanned-By: MIMEDefang 2.78 on 10.5.110.31


--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

FYI, we noticed that blogbench.write_score +6.5% improvement on

https://github.com/0day-ci/linux Rik-van-Riel/cpuidle-use-predicted_us-not-interactivity_req-to-consider-polling/20160312-043358
commit 256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4 ("SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)")


=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/rootfs/tbox_group/testcase:
  gcc-4.9/performance/1SSD/xfs/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-bdw01/blogbench

commit: 
  e6a4261e5a9e3ff89a4667afd6a314352654e2ef
  256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4

e6a4261e5a9e3ff8 256126a34c2a087a2a9900c2c7 
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
  53652546 ±  1%      +5.6%   56637238 ±  1%  blogbench.time.file_system_outputs
    632330 ±  1%      -5.5%     597814 ±  1%  blogbench.time.involuntary_context_switches
      4861 ± 55%     -40.8%       2880 ±  2%  blogbench.time.maximum_resident_set_size
    206.67 ±  0%      -4.4%     197.50 ±  0%  blogbench.time.percent_of_cpu_this_job_got
    584.62 ±  0%      -4.6%     557.91 ±  0%  blogbench.time.system_time
   4176142 ±  0%      -1.7%    4105578 ±  0%  blogbench.time.voluntary_context_switches
      7471 ±  1%      +6.5%       7955 ±  1%  blogbench.write_score
      2646 ±  8%     +36.9%       3623 ± 14%  meminfo.Writeback
    107366 ±  1%      +5.1%     112827 ±  0%  vmstat.io.bo
     28583 ±  0%      +0.9%      28837 ±  1%  vmstat.system.cs
     85.70 ±  0%     -31.3%      58.88 ±  0%  turbostat.%Busy
      1881 ±  0%     -31.3%       1292 ±  0%  turbostat.Avg_MHz
     13.83 ±  0%    +193.9%      40.66 ±  0%  turbostat.CPU%c1
      6.24 ±  0%     -13.4%       5.40 ±  0%  turbostat.CorWatt
      8.11 ±  0%      -7.7%       7.48 ±  0%  turbostat.PkgWatt
 1.392e+08 ±  0%    +233.4%  4.639e+08 ±  0%  cpuidle.C1-BDW.time
    448770 ±  2%    +494.6%    2668387 ±  1%  cpuidle.C1-BDW.usage
    250762 ± 12%     -22.2%     195098 ± 15%  cpuidle.C6-BDW.time
   1002396 ± 24%     -37.6%     625501 ± 14%  cpuidle.C7s-BDW.time
 3.056e+08 ±  2%     -99.9%     220778 ± 33%  cpuidle.POLL.time
   2048094 ±  1%     -98.9%      22873 ±  8%  cpuidle.POLL.usage
     90.78 ±  8%     -23.4%      69.52 ± 20%  sched_debug.cfs_rq:/.util_avg.stddev
     -9028 ± -1%     -29.9%      -6332 ± -6%  sched_debug.cpu.nr_uninterruptible.0
      3003 ±  9%     -35.5%       1938 ± 18%  sched_debug.cpu.nr_uninterruptible.1
      4372 ±  6%     -24.9%       3281 ±  7%  sched_debug.cpu.nr_uninterruptible.2
      1725 ± 14%     -31.4%       1183 ± 17%  sched_debug.cpu.nr_uninterruptible.3
      4393 ±  6%     -24.4%       3319 ±  7%  sched_debug.cpu.nr_uninterruptible.max
     -9031 ± -1%     -29.8%      -6335 ± -6%  sched_debug.cpu.nr_uninterruptible.min
      5320 ±  1%     -29.2%       3766 ±  6%  sched_debug.cpu.nr_uninterruptible.stddev
    142568 ±  9%     +23.0%     175340 ± 13%  proc-vmstat.kswapd_high_wmark_hit_quickly
    184381 ±  7%     +19.0%     219502 ± 11%  proc-vmstat.kswapd_low_wmark_hit_quickly
    288558 ±  3%     -15.7%     243178 ±  9%  proc-vmstat.nr_vmscan_immediate_reclaim
    764.33 ±  8%     +27.9%     977.25 ± 10%  proc-vmstat.nr_writeback
    330124 ±  8%     +20.7%     398369 ± 12%  proc-vmstat.pageoutrun
    244293 ±  2%     -15.9%     205562 ±  6%  proc-vmstat.pgrotated
   1777197 ±  3%      +9.6%    1948586 ±  7%  proc-vmstat.pgscan_direct_dma32
    751311 ±  3%      +9.8%     824826 ±  7%  proc-vmstat.pgscan_direct_normal
   1699363 ±  3%     +10.2%    1872550 ±  6%  proc-vmstat.pgsteal_direct_dma32
    714246 ±  3%     +10.5%     788950 ±  7%  proc-vmstat.pgsteal_direct_normal
    348656 ±  3%      +4.4%     364123 ±  5%  proc-vmstat.workingset_nodereclaim

lkp-bdw01: Broadwell ULT
Memory: 4G



                                   cpuidle.POLL.time

  3.5e+08 ++----------------------------------------------------------------+
          |    .* .**                   *. **.       * .**.                 |
    3e+08 ++ **  *   *.**.**.**.**   *.*  *   *   *.* *    **.**.***.**.*  **
          |  :                   :   :        :   :                     :  :|
  2.5e+08 ++ :                   :   :        :   :                     :  :|
          |  :                   :   :        :   :                     :  :|
    2e+08 ++:                    :   :        :   :                     : : |
          | :                    :   :        :   :                     : : |
  1.5e+08 ++:                     :  :         :  :                      :: |
          | :                     : :          : :                       :: |
    1e+08 ++:                     : :          : :                       :: |
          | :                     : :          : :                       :: |
    5e+07 ++                      : :          : :                       :  |
          |:                      : :          : :                       :  |
        0 OO-OO-OO-OOO-OO-OO-OO-OOO-OO-O-------*-*-----------------------*--+


                                  cpuidle.POLL.usage

  2.5e+06 ++----------------------------------------------------------------+
          |                                                                 |
          |   *.**.*  .*         *    .**.**       .***.*    .*   * .**.*  *|
    2e+06 ++ *      **  *.**.**.*:   *      *.*   *      *.**  *.* *    :  :*
          |  :                   :   :        :   :                     :  :|
          |  :                   :   :        :   :                     :  :|
  1.5e+06 ++:                    :   :        :   :                     : : |
          | :                    :   :        :   :                     : : |
    1e+06 ++:                     :  :        :   :                      :: |
          | :                     : :          : :                       :: |
          | :                     : :          : :                       :: |
   500000 ++:                     : :          : :                       :: |
          |:                      : :          : :                       :  |
          |:                      : :          : :                       :  |
        0 OO-OO-OO-OOO-OO-OO-OO-OOO-OO-O-------*-*-----------------------*--+


                                  cpuidle.C1-BDW.time

    5e+08 ++----------------------------------------------------------------+
  4.5e+08 OO OO OO OOO OO OO OO OOO OO O                                    |
          |                                                                 |
    4e+08 ++                                                                |
  3.5e+08 ++                                                                |
          |                                                                 |
    3e+08 ++                                                                |
  2.5e+08 ++                                                                |
    2e+08 ++                                                                |
          |                                                                 |
  1.5e+08 ++ **.**.***.**.**.**.**   *.**.***.*   *.***.**.**.**.***.**.*  **
    1e+08 ++ :                   :   :        :   :                     :  :|
          | :                     :  :         :  :                      :: |
    5e+07 ++:                     : :          : :                       :: |
        0 **----------------------*-*----------*-*-----------------------*--+


                                 cpuidle.C1-BDW.usage

    3e+06 ++----------------------------------------------------------------+
          |        O O O  O     OO   O O                                    |
  2.5e+06 OO OO OO  O   O  O OO   O O                                       |
          |                                                                 |
          |                                                                 |
    2e+06 ++                                                                |
          |                                                                 |
  1.5e+06 ++                                                                |
          |                                                                 |
    1e+06 ++                                                                |
          |                                                                 |
          |                                                                 |
   500000 ++ **.**.***.**.**.**.**   *.**.***.*   *.***.**.**.**.***.**.*  **
          | +                     :  :         :  :                      :+ |
        0 **----------------------*-*----------*-*-----------------------*--+


                                 turbostat.Avg_MHz

  2000 ++-------------------------------------------------------------------+
  1800 ++ *.**.**.**.**.**.**.**    **.**.**.*   *.**.**.**.**.**.**.**.*  **
       |  :                    :    :        :   :                      :  :|
  1600 ++ :                    :    :        :   :                      :  :|
  1400 ++ :                    :    :        :   :                      :  :|
       O OO OO OO OO OO OO OO OO:OO:OO       :   :                      : : |
  1200 ++ :                     :  :         :   :                      : : |
  1000 ++ :                     :  :         :   :                      : : |
   800 ++:                      :  :          : :                        :: |
       | :                      :  :          : :                        :: |
   600 ++:                      :  :          : :                        :: |
   400 ++:                       ::           : :                        :  |
       | :                       ::           : :                        :  |
   200 ++:                       ::           : :                        :  |
     0 *+*-----------------------**-----------*-*------------------------*--+


                                 turbostat._Busy

  90 ++---------------------------------------------------------------------+
     |  *.**.**.**.**.*.**.**.*   *.**.*.**.*   *.**.**.**.*.**.**.**.**  *.*
  80 ++ :                     :   :         :   :                      :  : |
  70 ++ :                     :   :         :   :                      :  : |
     |  :                     :   :         :   :                      :  : |
  60 O+OO OO OO OO OO O OO OO OO OO O       :   :                       : : |
  50 ++ :                     :   :         :   :                       : : |
     |  :                     :   :         :   :                       : : |
  40 ++:                       : :           : :                        ::  |
  30 ++:                       : :           : :                        ::  |
     | :                       : :           : :                        ::  |
  20 ++:                       : :           : :                         :  |
  10 ++:                       : :           : :                         :  |
     | :                       : :           : :                         :  |
   0 *+*-----------------------*-*-----------*-*-------------------------*--+


                                turbostat.CPU_c1

  45 ++---------------------------------------------------------------------+
     O OO  O OO  O O  O  O OO OO OO O                                       |
  40 ++   O     O   O   O                                                   |
  35 ++                                                                     |
     |                                                                      |
  30 ++                                                                     |
  25 ++                                                                     |
     |                                                                      |
  20 ++                                                                     |
  15 ++                                                   .*.*              |
     |  *.**.**.**.**.*.**.**.*   *.**.*.**.*   *.**.**.**    *.**.**.**  *.*
  10 ++ :                     :   :         :   :                       : : |
   5 ++ :                      :  :          :  :                       : : |
     | :                       : :           : :                         :  |
   0 *+*-----------------------*-*-----------*-*-------------------------*--+


                                turbostat.PkgWatt

  9 ++----------------------------------------------------------------------+
    |  *.**.**.*.**.**.**.**.*    **.**.*.**    **.*.**.**.**.**.*.**.**  *.*
  8 O+OO OO OO O OO OO OO OO O OO OO       :    :                      :  : |
  7 ++ :                     :    :        :    :                      :  : |
    |  :                     :    :        :    :                      :  : |
  6 ++ :                      :  :          :  :                        : : |
  5 ++ :                      :  :          :  :                        : : |
    |  :                      :  :          :  :                        : : |
  4 ++:                       :  :          :  :                        ::  |
  3 ++:                       :  :          :  :                        ::  |
    | :                       :  :          :  :                        ::  |
  2 ++:                        ::            ::                          :  |
  1 ++:                        ::            ::                          :  |
    | :                        ::            ::                          :  |
  0 *+*------------------------**------------**--------------------------*--+


                                turbostat.CorWatt

  7 ++----------------------------------------------------------------------+
    |                *.                                        *.          .*
  6 ++ *.**.**.*.**.*  **.**.*    **.**.*.**    **.*.**.**.**.*  *.**.**  * |
    O OO OO OO O OO OO OO OO O OO OO       :    :                      :  : |
  5 ++ :                     :    :        :    :                      :  : |
    |  :                      :   :         :   :                       : : |
  4 ++ :                      :  :          :  :                        : : |
    |  :                      :  :          :  :                        : : |
  3 ++ :                      :  :          :  :                        : : |
    | :                       :  :          :  :                        ::  |
  2 ++:                       :  :          :  :                        ::  |
    | :                        : :           : :                         :  |
  1 ++:                        ::            ::                          :  |
    | :                        ::            ::                          :  |
  0 *+*------------------------**------------**--------------------------*--+


	[*] bisect-good sample
	[O] bisect-bad  sample

To reproduce:

        git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
        cd lkp-tests
        bin/lkp install job.yaml  # job file is attached in this email
        bin/lkp run     job.yaml


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Xiaolong Ye.

--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="job.yaml"

---
LKP_SERVER: inn
LKP_CGI_PORT: 80
LKP_CIFS_PORT: 139
testcase: blogbench
default-monitors:
  wait: activate-monitor
  kmsg: 
  uptime: 
  iostat: 
  heartbeat: 
  vmstat: 
  numa-numastat: 
  numa-vmstat: 
  numa-meminfo: 
  proc-vmstat: 
  proc-stat:
    interval: 10
  meminfo: 
  slabinfo: 
  interrupts: 
  lock_stat: 
  latency_stats: 
  softirqs: 
  bdi_dev_mapping: 
  diskstats: 
  nfsstat: 
  cpuidle: 
  cpufreq-stats: 
  turbostat: 
  pmeter: 
  sched_debug:
    interval: 60
cpufreq_governor: performance
default-watchdogs:
  oom-killer: 
  watchdog: 
commit: 256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4
model: Broadwell ULT
nr_cpu: 4
memory: 4G
nr_ssd_partitions: 1
ssd_partitions: "/dev/disk/by-id/ata-INTEL_SSDSC2CW120A3_CVCV252505YF120BGN-part1"
rootfs_partition: "/dev/disk/by-id/ata-INTEL_SSDSC2CW120A3_CVCV252505YF120BGN-part2"
category: benchmark
disk: 1SSD
fs: xfs
blogbench: 
queue: bisect
testbox: lkp-bdw01
tbox_group: lkp-bdw01
kconfig: x86_64-rhel
enqueue_time: 2016-03-14 10:05:06.928750572 +08:00
compiler: gcc-4.9
rootfs: debian-x86_64-2015-02-07.cgz
id: fb599b2e44822b788d923a0908824e7dfe82e865
user: lkp
head_commit: e1eef6e9c27d24f398dc71e4fb13ebb6e2a13f60
base_commit: f6cede5b49e822ebc41a099fe41ab4989f64e2cb
branch: linux-devel/devel-hourly-2016031318
result_root: "/result/blogbench/performance-1SSD-xfs/lkp-bdw01/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/0"
job_file: "/lkp/scheduled/lkp-bdw01/bisect_blogbench-performance-1SSD-xfs-debian-x86_64-2015-02-07.cgz-x86_64-rhel-256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4-20160314-20015-1bwze2e-0.yaml"
max_uptime: 2613.3599999999997
initrd: "/osimage/debian/debian-x86_64-2015-02-07.cgz"
bootloader_append:
- root=/dev/ram0
- user=lkp
- job=/lkp/scheduled/lkp-bdw01/bisect_blogbench-performance-1SSD-xfs-debian-x86_64-2015-02-07.cgz-x86_64-rhel-256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4-20160314-20015-1bwze2e-0.yaml
- ARCH=x86_64
- kconfig=x86_64-rhel
- branch=linux-devel/devel-hourly-2016031318
- commit=256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4
- BOOT_IMAGE=/pkg/linux/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/vmlinuz-4.5.0-rc7-00188-g256126a
- max_uptime=2613
- RESULT_ROOT=/result/blogbench/performance-1SSD-xfs/lkp-bdw01/debian-x86_64-2015-02-07.cgz/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/0
- LKP_SERVER=inn
- |2-


  earlyprintk=ttyS0,115200 systemd.log_level=err
  debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100
  panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0
  console=ttyS0,115200 console=tty0 vga=normal

  rw
lkp_initrd: "/lkp/lkp/lkp-x86_64.cgz"
modules_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/modules.cgz"
bm_initrd: "/osimage/deps/debian-x86_64-2015-02-07.cgz/lkp.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/run-ipconfig.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/turbostat.cgz,/lkp/benchmarks/turbostat.cgz,/osimage/deps/debian-x86_64-2015-02-07.cgz/fs.cgz,/lkp/benchmarks/blogbench.cgz"
linux_headers_initrd: "/pkg/linux/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/linux-headers.cgz"
repeat_to: 2
kernel: "/pkg/linux/x86_64-rhel/gcc-4.9/256126a34c2a087a2a9900c2c7ac9a7c4f5e54e4/vmlinuz-4.5.0-rc7-00188-g256126a"
dequeue_time: 2016-03-14 10:10:24.587691856 +08:00
job_state: finished
loadavg: 101.27 65.81 28.50 1/213 4731
start_time: '5752888769'
end_time: '5752889070'
version: "/lkp/lkp/.src-20160311-183612"

--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=reproduce

2152-04-20 16:39:14 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
2152-04-20 16:39:14 echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
2152-04-20 16:39:14 echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
2152-04-20 16:39:14 echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
2152-04-20 16:39:14 mkfs -t xfs /dev/sda1
2152-04-20 16:39:28 mount -t xfs -o nobarrier,inode64 /dev/sda1 /fs/sda1
2152-04-20 16:39:29 ./blogbench -d /fs/sda1

--LZvS9be/3tNcYl/X--

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 14:01                                                       ` Rik van Riel
@ 2016-03-16 14:14                                                         ` Rafael J. Wysocki
  2016-03-16 14:46                                                           ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 14:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Doug Smythies,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Wed, Mar 16, 2016 at 3:01 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 16 Mar 2016 14:02:33 +0100
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
>> My concern here is that this might defeat the original purpose of
>> commit a9ceb78bc75c, though, because it would make us use C1 more
>> aggressively in general on everything where the sum of C1 exit latency
>> and C1 target residency is less than 20.
>
> I just got this email from the test robot this morning.
>
> Apparently the test patch I sent to Arto results in a 6.5%
> improvement on blogbench, so there is a performance benefit
> to polling when we expect short sleeps (not just a short
> timer timeout, but corrected for typical interval, etc).

That's not unexpected. :-)

However, question is whether or not that offsets the increased energy
consumption.  The entire point of using C1 as the fallback state
instead of polling is the observation that polling is too costly
energetically overall to be used as the fallback state
unconditionally.

> On the other hand, being more aggressive about C1 allows
> other cores to run at higher frequencies...

Good point.

We need to decide, though.

Do we generally want to use more polling or more C1?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 14:14                                                         ` Rafael J. Wysocki
@ 2016-03-16 14:46                                                           ` Rik van Riel
  2016-03-16 15:05                                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-16 14:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

On Wed, 2016-03-16 at 15:14 +0100, Rafael J. Wysocki wrote:
> On Wed, Mar 16, 2016 at 3:01 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > On the other hand, being more aggressive about C1 allows
> > other cores to run at higher frequencies...
> Good point.
> 
> We need to decide, though.
> 
> Do we generally want to use more polling or more C1?

I can see five distinct data points for decision making:

1) data->next_timer_us where we know for sure the sleep
   interval will be less than C1 latency -> poll

2) latency_req as configured by the user, if this is
   smaller than C1
latency we need poll  (we currently
   get this wrong)

3) interval determined by get_typical_interval, this we
   may know with a higher confidence value, and choosing
   poll may actually be appropriate

4) data->predicted_us corrected only by bucket correction
   factor - this is a gamble, so we probably want C1,
   even if data->predicted_us is low

5) interactivity_req, which is data->predicted_us
   divided by a load factor - we should probably ignore
   that for poll since it is way too aggressive, and
   stick to C1 and deeper for this factor

Does the above make sense?

Currently the code merges (3) and (4) into the same
value before making a comparison, and neglects to take
(2) into account at all when deciding whether to poll.

Would you like me to write a patch to take the user
configured latency_req into account, separate out
the predicted_us and typical_interval, and test whether
to poll against the smaller of next_timer_us, latency_req,
and the typical_interval, while we continue to use the
predicted_us as is in the main selection loop?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 14:46                                                           ` Rik van Riel
@ 2016-03-16 15:05                                                             ` Rafael J. Wysocki
  2016-03-16 15:07                                                               ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 15:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Doug Smythies,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Wed, Mar 16, 2016 at 3:46 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2016-03-16 at 15:14 +0100, Rafael J. Wysocki wrote:
>> On Wed, Mar 16, 2016 at 3:01 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > On the other hand, being more aggressive about C1 allows
>> > other cores to run at higher frequencies...
>> Good point.
>>
>> We need to decide, though.
>>
>> Do we generally want to use more polling or more C1?
>
> I can see five distinct data points for decision making:
>
> 1) data->next_timer_us where we know for sure the sleep
>    interval will be less than C1 latency -> poll

Almost right, modulo my previous comment: that should be the sum of C1
exit latency and C1 target residency.

> 2) latency_req as configured by the user, if this is
>    smaller than C1
> latency we need poll  (we currently
>    get this wrong)

Correct.

> 3) interval determined by get_typical_interval, this we
>    may know with a higher confidence value, and choosing
>    poll may actually be appropriate

Right.

> 4) data->predicted_us corrected only by bucket correction
>    factor - this is a gamble, so we probably want C1,
>    even if data->predicted_us is low

Agreed.

> 5) interactivity_req, which is data->predicted_us
>    divided by a load factor - we should probably ignore
>    that for poll since it is way too aggressive, and
>    stick to C1 and deeper for this factor

Agreed.

> Does the above make sense?

Yes, it does.

> Currently the code merges (3) and (4) into the same
> value before making a comparison, and neglects to take
> (2) into account at all when deciding whether to poll.
>
> Would you like me to write a patch to take the user
> configured latency_req into account, separate out
> the predicted_us and typical_interval, and test whether
> to poll against the smaller of next_timer_us, latency_req,
> and the typical_interval, while we continue to use the
> predicted_us as is in the main selection loop?

Yes, please.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 15:05                                                             ` Rafael J. Wysocki
@ 2016-03-16 15:07                                                               ` Rik van Riel
  2016-03-16 15:09                                                                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-16 15:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1428 bytes --]

On Wed, 2016-03-16 at 16:05 +0100, Rafael J. Wysocki wrote:
> On Wed, Mar 16, 2016 at 3:46 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > On Wed, 2016-03-16 at 15:14 +0100, Rafael J. Wysocki wrote:
> > > 
> > > On Wed, Mar 16, 2016 at 3:01 PM, Rik van Riel <riel@redhat.com>
> > > wrote:
> > > > 
> > > > 
> > > > On the other hand, being more aggressive about C1 allows
> > > > other cores to run at higher frequencies...
> > > Good point.
> > > 
> > > We need to decide, though.
> > > 
> > > Do we generally want to use more polling or more C1?
> > I can see five distinct data points for decision making:
> > 
> > 1) data->next_timer_us where we know for sure the sleep
> >    interval will be less than C1 latency -> poll
> Almost right, modulo my previous comment: that should be the sum of
> C1
> exit latency and C1 target residency.

Doesn't the residency include the exit latency?

After all, we subtract the exit latency from the
measured residency time in menu_update.

> > Would you like me to write a patch to take the user
> > configured latency_req into account, separate out
> > the predicted_us and typical_interval, and test whether
> > to poll against the smaller of next_timer_us, latency_req,
> > and the typical_interval, while we continue to use the
> > predicted_us as is in the main selection loop?
> Yes, please.

Will do.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4)
  2016-03-16 15:07                                                               ` Rik van Riel
@ 2016-03-16 15:09                                                                 ` Rafael J. Wysocki
  2016-03-16 16:14                                                                   ` [PATCH] cpuidle: use high confidence factors only when considering polling Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 15:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Doug Smythies,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Wed, Mar 16, 2016 at 4:07 PM, Rik van Riel <riel@redhat.com> wrote:
> On Wed, 2016-03-16 at 16:05 +0100, Rafael J. Wysocki wrote:
>> On Wed, Mar 16, 2016 at 3:46 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > On Wed, 2016-03-16 at 15:14 +0100, Rafael J. Wysocki wrote:
>> > >
>> > > On Wed, Mar 16, 2016 at 3:01 PM, Rik van Riel <riel@redhat.com>
>> > > wrote:
>> > > >
>> > > >
>> > > > On the other hand, being more aggressive about C1 allows
>> > > > other cores to run at higher frequencies...
>> > > Good point.
>> > >
>> > > We need to decide, though.
>> > >
>> > > Do we generally want to use more polling or more C1?
>> > I can see five distinct data points for decision making:
>> >
>> > 1) data->next_timer_us where we know for sure the sleep
>> >    interval will be less than C1 latency -> poll
>> Almost right, modulo my previous comment: that should be the sum of
>> C1
>> exit latency and C1 target residency.
>
> Doesn't the residency include the exit latency?
>
> After all, we subtract the exit latency from the
> measured residency time in menu_update.

Yes, it does, sorry.  So it should be the target residency. :-)

>> > Would you like me to write a patch to take the user
>> > configured latency_req into account, separate out
>> > the predicted_us and typical_interval, and test whether
>> > to poll against the smaller of next_timer_us, latency_req,
>> > and the typical_interval, while we continue to use the
>> > predicted_us as is in the main selection loop?
>> Yes, please.
>
> Will do.

Cool, thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-16 15:09                                                                 ` Rafael J. Wysocki
@ 2016-03-16 16:14                                                                   ` Rik van Riel
  2016-03-18  0:45                                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-16 16:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

The menu governor uses five different factors to pick the
idle state:
- the user configured latency_req
- the time until the next timer (next_timer_us)
- the typical sleep interval, as measured recently
- an estimate of sleep time by dividing next_timer_us by an observed factor
- a load corrected version of the above, divided again by load

Only the first three items are known with enough confidence that
we can use them to consider polling, instead of an actual CPU
idle state, because the cost of being wrong about polling can be
excessive power use.

The latter two are used in the menu governor's main selection
loop, and can result in choosing a shallower idle state when
the system is expected to be busy again soon.

This pushes a busy system in the "performance" direction of
the performance<>power tradeoff, when choosing between idle
states, but stays more strictly on the "power" state when
deciding between polling and C1.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/governors/menu.c | 42 +++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..fba867d917f7 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -196,7 +196,7 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
  * of points is below a threshold. If it is... then use the
  * average of these 8 points as the estimated value.
  */
-static void get_typical_interval(struct menu_device *data)
+static unsigned int get_typical_interval(struct menu_device *data)
 {
 	int i, divisor;
 	unsigned int max, thresh;
@@ -254,9 +254,7 @@ static void get_typical_interval(struct menu_device *data)
 		stddev = int_sqrt(stddev);
 		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
 							|| stddev <= 20) {
-			if (data->next_timer_us > avg)
-				data->predicted_us = avg;
-			return;
+			return avg;
 		}
 	}
 
@@ -270,7 +268,7 @@ static void get_typical_interval(struct menu_device *data)
 	 * with sporadic activity with a bunch of short pauses.
 	 */
 	if ((divisor * 4) <= INTERVALS * 3)
-		return;
+		return UINT_MAX;
 
 	thresh = max - 1;
 	goto again;
@@ -287,6 +285,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
 	int i;
 	unsigned int interactivity_req;
+	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 
 	if (data->needs_update) {
@@ -313,32 +312,39 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 					 data->correction_factor[data->bucket],
 					 RESOLUTION * DECAY);
 
-	get_typical_interval(data);
-
-	/*
-	 * Performance multiplier defines a minimum predicted idle
-	 * duration / latency ratio. Adjust the latency limit if
-	 * necessary.
-	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
-	if (latency_req > interactivity_req)
-		latency_req = interactivity_req;
+	expected_interval = get_typical_interval(data);
+	expected_interval = min(expected_interval, data->next_timer_us);
 
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
-		 * unless the timer is happening really really soon.
+		 * unless the timer is happening really really soon, or
+		 * C1's exit latency exceeds the user configured limit.
 		 */
-		if (interactivity_req > 20 &&
+		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
+		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
-			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	} else {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	}
 
 	/*
+	 * Use the lowest expected idle interval to pick the idle state.
+	 */
+	data->predicted_us = min(data->predicted_us, expected_interval);
+
+	/*
+	 * Use the performance multiplier and the user-configurable
+	 * latency_req to determine the maximum exit latency.
+	 */
+	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+	if (latency_req > interactivity_req)
+		latency_req = interactivity_req;
+
+	/*
 	 * Find the idle state with the lowest power while satisfying
 	 * our constraints.
 	 */


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-16 16:14                                                                   ` [PATCH] cpuidle: use high confidence factors only when considering polling Rik van Riel
@ 2016-03-18  0:45                                                                     ` Rafael J. Wysocki
  2016-03-18  6:32                                                                       ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18  0:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
> The menu governor uses five different factors to pick the
> idle state:
> - the user configured latency_req
> - the time until the next timer (next_timer_us)
> - the typical sleep interval, as measured recently
> - an estimate of sleep time by dividing next_timer_us by an observed factor
> - a load corrected version of the above, divided again by load
> 
> Only the first three items are known with enough confidence that
> we can use them to consider polling, instead of an actual CPU
> idle state, because the cost of being wrong about polling can be
> excessive power use.
> 
> The latter two are used in the menu governor's main selection
> loop, and can result in choosing a shallower idle state when
> the system is expected to be busy again soon.
> 
> This pushes a busy system in the "performance" direction of
> the performance<>power tradeoff, when choosing between idle
> states, but stays more strictly on the "power" state when
> deciding between polling and C1.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Applied, thanks!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18  0:45                                                                     ` Rafael J. Wysocki
@ 2016-03-18  6:32                                                                       ` Doug Smythies
  2016-03-18 13:11                                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Doug Smythies @ 2016-03-18  6:32 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Viresh Kumar',
	'Srinivas Pandruvada', 'Chen, Yu C',
	linux-pm, 'Arto Jantunen', 'Len Brown'

Sorry for the delay in my reply / test. The patch e-mail went
to my junk folder for some reason.

On 2106.03.17 17:46 Rafael J. Wysocki wrote:
> On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
>> The menu governor uses five different factors to pick the
>> idle state:
>> - the user configured latency_req
>> - the time until the next timer (next_timer_us)
>> - the typical sleep interval, as measured recently
>> - an estimate of sleep time by dividing next_timer_us by an observed factor
>> - a load corrected version of the above, divided again by load
>> 
>> Only the first three items are known with enough confidence that
>> we can use them to consider polling, instead of an actual CPU
>> idle state, because the cost of being wrong about polling can be
>> excessive power use.
>> 
>> The latter two are used in the menu governor's main selection
>> loop, and can result in choosing a shallower idle state when
>> the system is expected to be busy again soon.
>> 
>> This pushes a busy system in the "performance" direction of
>> the performance<>power tradeoff, when choosing between idle
>> states, but stays more strictly on the "power" state when
>> deciding between polling and C1.
>> 
>> Signed-off-by: Rik van Riel <riel@redhat.com>

For my part of it, this patch seems to be not O.K.
(reference rvr5 = this patch)

Aggregate idle for the 2000 second test. All in minutes.
(old tests re-stated)

State	k45rc7-rjw10	k45rc7-rjw10-reverted	k45rc7-rjw10-rvr5
0.00	18.07			0.92				18.67
1.00	12.35			19.51				12.82
2.00	3.96			4.28				3.97
3.00	1.55			1.53				1.58
4.00	138.96		141.99			143.80
			
total	174.90		168.24			180.84

Energy:
>> Kernel 4.5-rc7-rjw10: 61983 Joules
>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules

I did acquire trace data with this test, but haven't post processed it yet.

... Doug



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18  6:32                                                                       ` Doug Smythies
@ 2016-03-18 13:11                                                                         ` Rafael J. Wysocki
  2016-03-18 18:32                                                                           ` Doug Smythies
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 13:11 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rik van Riel, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Fri, Mar 18, 2016 at 7:32 AM, Doug Smythies <dsmythies@telus.net> wrote:
> Sorry for the delay in my reply / test. The patch e-mail went
> to my junk folder for some reason.
>
> On 2106.03.17 17:46 Rafael J. Wysocki wrote:
>> On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
>>> The menu governor uses five different factors to pick the
>>> idle state:
>>> - the user configured latency_req
>>> - the time until the next timer (next_timer_us)
>>> - the typical sleep interval, as measured recently
>>> - an estimate of sleep time by dividing next_timer_us by an observed factor
>>> - a load corrected version of the above, divided again by load
>>>
>>> Only the first three items are known with enough confidence that
>>> we can use them to consider polling, instead of an actual CPU
>>> idle state, because the cost of being wrong about polling can be
>>> excessive power use.
>>>
>>> The latter two are used in the menu governor's main selection
>>> loop, and can result in choosing a shallower idle state when
>>> the system is expected to be busy again soon.
>>>
>>> This pushes a busy system in the "performance" direction of
>>> the performance<>power tradeoff, when choosing between idle
>>> states, but stays more strictly on the "power" state when
>>> deciding between polling and C1.
>>>
>>> Signed-off-by: Rik van Riel <riel@redhat.com>
>
> For my part of it, this patch seems to be not O.K.
> (reference rvr5 = this patch)
>
> Aggregate idle for the 2000 second test. All in minutes.
> (old tests re-stated)
>
> State   k45rc7-rjw10    k45rc7-rjw10-reverted   k45rc7-rjw10-rvr5
> 0.00    18.07                   0.92                            18.67
> 1.00    12.35                   19.51                           12.82
> 2.00    3.96                    4.28                            3.97
> 3.00    1.55                    1.53                            1.58
> 4.00    138.96          141.99                  143.80
>
> total   174.90          168.24                  180.84
>
> Energy:
>>> Kernel 4.5-rc7-rjw10: 61983 Joules
>>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
> Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules
>
> I did acquire trace data with this test, but haven't post processed it yet.

I'm wondering what happens if you replace the expected_interval in the
"expected_interval >
drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
data->next_timer_us (with the Rik's patch applied, of course).  Can
you please try doing that?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 13:11                                                                         ` Rafael J. Wysocki
@ 2016-03-18 18:32                                                                           ` Doug Smythies
  2016-03-18 19:29                                                                             ` Rik van Riel
  2016-03-18 21:35                                                                             ` Rik van Riel
  0 siblings, 2 replies; 59+ messages in thread
From: Doug Smythies @ 2016-03-18 18:32 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Rik van Riel',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On 2016.03.18 06:12 Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 7:32 AM, Doug Smythies <dsmythies@telus.net> wrote:
>
>> For my part of it, this patch seems to be not O.K.
>> (reference rvr5 = this patch)
>>
>> Aggregate idle for the 2000 second test. All in minutes.
>> (old tests re-stated)
>>
>> State	k45rc7-rjw10	k45rc7-rjw10-reverted	k45rc7-rjw10-rvr5
>> 0.00	18.07			0.92				18.67
>> 1.00	12.35			19.51				12.82
>> 2.00	3.96			4.28				3.97
>> 3.00	1.55			1.53				1.58
>> 4.00	138.96		141.99			143.80
>>
>> total	174.90		168.24			180.84
>>
>> Energy:
>>>> Kernel 4.5-rc7-rjw10: 61983 Joules
>>>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
>> Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules
>>
>> I did acquire trace data with this test, but haven't post processed it yet.
>
> I'm wondering what happens if you replace the expected_interval in the
> "expected_interval >
> drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
> data->next_timer_us (with the Rik's patch applied, of course).  Can
> you please try doing that?

O.K. my reference: rvr6 is the above modification to rvr5
It works as well as "reverted"/

State	k45rc7-rjw10-rvr6 (mins)
0.00	0.87
1.00	24.20
2.00	4.05
3.00	1.72
4.00	147.50

total	178.34

Energy:
Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules

Trace data (very crude summary):
Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)

... Doug



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 18:32                                                                           ` Doug Smythies
@ 2016-03-18 19:29                                                                             ` Rik van Riel
  2016-03-18 20:59                                                                               ` Doug Smythies
                                                                                                 ` (2 more replies)
  2016-03-18 21:35                                                                             ` Rik van Riel
  1 sibling, 3 replies; 59+ messages in thread
From: Rik van Riel @ 2016-03-18 19:29 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On Fri, 18 Mar 2016 11:32:28 -0700
"Doug Smythies" <dsmythies@telus.net> wrote:
> On 2016.03.18 06:12 Rafael J. Wysocki wrote:

> > I'm wondering what happens if you replace the expected_interval in the
> > "expected_interval >
> > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
> > data->next_timer_us (with the Rik's patch applied, of course).  Can
> > you please try doing that?  
> 
> O.K. my reference: rvr6 is the above modification to rvr5
> It works as well as "reverted"/
> 
> State	k45rc7-rjw10-rvr6 (mins)
> 0.00	0.87
> 1.00	24.20
> 2.00	4.05
> 3.00	1.72
> 4.00	147.50
> 
> total	178.34
> 
> Energy:
> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> 
> Trace data (very crude summary):
> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)

What does "long duration" mean? 
Dozens of microseconds?
Hundreds of microseconds?
Milliseconds?

Either way, it appears there is something wrong with the
code in get_typical_interval.  One of the problems is
that calculating in microseconds, when working with a
threshold of 1-2 microseconds is not going to work well,
and secondly the code declares success the moment the
standard deviation is below 20 microseconds, which is
also not the best idea when dealing with 1-2 microsecond
thresholds :)

Does the below patch help?

If it does, we will probably want to change the values in
the driver tables to nanoseconds too, so we can get rid of
the multiplications :)

---8<---

Subject: cpuidle: track time in nsecs not usecs

The cpuidle code currently tracks time in microseconds, which
has a few issues. For one, the kernel natively tracks time in
nanoseconds, and we do a divide by 1000 when tracking how long
the CPU spent in an idle state.

Secondly, get_typical_interval in the menu governor cannot do
very useful math when working with small numbers, while the
HLT cutoff is only 1 or 2 microseconds on many CPUs.

Converting to nanoseconds avoids the expensive divide, and
also allows get_typical_interval to do calculations that
may make more sense for smaller intervals.

While we're there, also remove the code that allows
get_typical_interval to return a value as soon as the
standard deviation is under 20 microseconds.  That
threshold makes little sense when C1 and C2 both
have latencies much lower than that.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/cpuidle.c          | 10 ++---
 drivers/cpuidle/governors/ladder.c |  2 +-
 drivers/cpuidle/governors/menu.c   | 90 +++++++++++++++++---------------------
 drivers/cpuidle/sysfs.c            |  8 +++-
 include/linux/cpuidle.h            |  4 +-
 5 files changed, 56 insertions(+), 58 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index f996efc56605..8a0673880fbe 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -217,18 +217,18 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 	if (!cpuidle_state_is_coupled(drv, entered_state))
 		local_irq_enable();
 
-	diff = ktime_to_us(ktime_sub(time_end, time_start));
-	if (diff > INT_MAX)
-		diff = INT_MAX;
+	diff = ktime_to_ns(ktime_sub(time_end, time_start));
+	if (diff > LONG_MAX)
+		diff = LONG_MAX;
 
-	dev->last_residency = (int) diff;
+	dev->last_residency = diff;
 
 	if (entered_state >= 0) {
 		/* Update cpuidle counters */
 		/* This can be moved to within driver enter routine
 		 * but that results in multiple copies of same code.
 		 */
-		dev->states_usage[entered_state].time += dev->last_residency;
+		dev->states_usage[entered_state].time += diff;
 		dev->states_usage[entered_state].usage++;
 	} else {
 		dev->last_residency = 0;
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index 63bd5a403e22..6b480a3a3a2e 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -80,7 +80,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
 
 	last_state = &ldev->states[last_idx];
 
-	last_residency = cpuidle_get_last_residency(dev) - drv->states[last_idx].exit_latency;
+	last_residency = ktime_to_us(cpuidle_get_last_residency(dev)) - drv->states[last_idx].exit_latency;
 
 	/* consider promotion */
 	if (last_idx < drv->state_count - 1 &&
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index fba867d917f7..69adb6ab7b8c 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -33,9 +33,9 @@
 #define BUCKETS 12
 #define INTERVAL_SHIFT 3
 #define INTERVALS (1UL << INTERVAL_SHIFT)
-#define RESOLUTION 1024
+#define RESOLUTION NSEC_PER_USEC
 #define DECAY 8
-#define MAX_INTERESTING 50000
+#define MAX_INTERESTING 4000000
 
 
 /*
@@ -122,11 +122,11 @@ struct menu_device {
 	int		last_state_idx;
 	int             needs_update;
 
-	unsigned int	next_timer_us;
-	unsigned int	predicted_us;
+	u64		next_timer_ns;
+	u64		predicted_ns;
 	unsigned int	bucket;
 	unsigned int	correction_factor[BUCKETS];
-	unsigned int	intervals[INTERVALS];
+	u64		intervals[INTERVALS];
 	int		interval_ptr;
 };
 
@@ -152,15 +152,15 @@ static inline int which_bucket(unsigned int duration, unsigned long nr_iowaiters
 	if (nr_iowaiters)
 		bucket = BUCKETS/2;
 
-	if (duration < 10)
+	if (duration < 10000)
 		return bucket;
-	if (duration < 100)
+	if (duration < 100000)
 		return bucket + 1;
-	if (duration < 1000)
+	if (duration < 1000000)
 		return bucket + 2;
-	if (duration < 10000)
+	if (duration < 10000000)
 		return bucket + 3;
-	if (duration < 100000)
+	if (duration < 100000000)
 		return bucket + 4;
 	return bucket + 5;
 }
@@ -242,21 +242,12 @@ static unsigned int get_typical_interval(struct menu_device *data)
 	 * The typical interval is obtained when standard deviation is small
 	 * or standard deviation is small compared to the average interval.
 	 *
-	 * int_sqrt() formal parameter type is unsigned long. When the
-	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
-	 * the resulting squared standard deviation exceeds the input domain
-	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
-	 * In such case reject the candidate average.
-	 *
-	 * Use this result only if there is no timer to wake us up sooner.
+	 * Instead of taking the square root of the standard deviation,
+	 * use the square of the average. Be careful to avoid overflow.
 	 */
-	if (likely(stddev <= ULONG_MAX)) {
-		stddev = int_sqrt(stddev);
-		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
-							|| stddev <= 20) {
-			return avg;
-		}
-	}
+	if (avg < INT_MAX && (avg * avg) > stddev * 9 &&
+					divisor * 4 >= INTERVALS * 3)
+		return avg;
 
 	/*
 	 * If we have outliers to the upside in our distribution, discard
@@ -282,10 +273,10 @@ static unsigned int get_typical_interval(struct menu_device *data)
 static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	u64 latency_req = NSEC_PER_USEC * pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
 	int i;
-	unsigned int interactivity_req;
-	unsigned int expected_interval;
+	u64 interactivity_req;
+	u64 expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 
 	if (data->needs_update) {
@@ -298,22 +289,21 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		return 0;
 
 	/* determine the expected residency time, round up */
-	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+	data->next_timer_ns = ktime_to_ns(tick_nohz_get_sleep_length());
 
-	get_iowait_load(&nr_iowaiters, &cpu_load);
-	data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
+	data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);
 
 	/*
 	 * Force the result of multiplication to be 64 bits even if both
 	 * operands are 32 bits.
-	 * Make sure to round up for half microseconds.
+	 * Make sure to round up for half nanoseconds.
 	 */
-	data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)data->next_timer_us *
+	data->predicted_ns = DIV_ROUND_CLOSEST_ULL((uint64_t)data->next_timer_ns *
 					 data->correction_factor[data->bucket],
 					 RESOLUTION * DECAY);
 
 	expected_interval = get_typical_interval(data);
-	expected_interval = min(expected_interval, data->next_timer_us);
+	expected_interval = min(expected_interval, data->next_timer_ns);
 
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
@@ -322,8 +312,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		 * unless the timer is happening really really soon, or
 		 * C1's exit latency exceeds the user configured limit.
 		 */
-		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
-		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
+		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency * NSEC_PER_USEC &&
+		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency * NSEC_PER_USEC &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
@@ -334,13 +324,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	/*
 	 * Use the lowest expected idle interval to pick the idle state.
 	 */
-	data->predicted_us = min(data->predicted_us, expected_interval);
+	data->predicted_ns = min(data->predicted_ns, expected_interval);
 
 	/*
 	 * Use the performance multiplier and the user-configurable
 	 * latency_req to determine the maximum exit latency.
 	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+	get_iowait_load(&nr_iowaiters, &cpu_load);
+	interactivity_req = data->predicted_ns;
+	do_div(interactivity_req, performance_multiplier(nr_iowaiters, cpu_load));
 	if (latency_req > interactivity_req)
 		latency_req = interactivity_req;
 
@@ -354,9 +346,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 
 		if (s->disabled || su->disable)
 			continue;
-		if (s->target_residency > data->predicted_us)
+		if (s->target_residency * NSEC_PER_USEC > data->predicted_ns)
 			continue;
-		if (s->exit_latency > latency_req)
+		if (s->exit_latency * NSEC_PER_USEC > latency_req)
 			continue;
 
 		data->last_state_idx = i;
@@ -391,8 +383,8 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	int last_idx = data->last_state_idx;
 	struct cpuidle_state *target = &drv->states[last_idx];
-	unsigned int measured_us;
 	unsigned int new_factor;
+	u64 measured_ns;
 
 	/*
 	 * Try to figure out how much time passed between entry to low
@@ -410,24 +402,24 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	 */
 
 	/* measured value */
-	measured_us = cpuidle_get_last_residency(dev);
+	measured_ns = cpuidle_get_last_residency(dev);
 
 	/* Deduct exit latency */
-	if (measured_us > 2 * target->exit_latency)
-		measured_us -= target->exit_latency;
+	if (measured_ns > 2 * NSEC_PER_USEC * target->exit_latency)
+		measured_ns -= NSEC_PER_USEC * target->exit_latency;
 	else
-		measured_us /= 2;
+		measured_ns /= 2;
 
 	/* Make sure our coefficients do not exceed unity */
-	if (measured_us > data->next_timer_us)
-		measured_us = data->next_timer_us;
+	if (measured_ns > data->next_timer_ns)
+		measured_ns = data->next_timer_ns;
 
 	/* Update our correction ratio */
 	new_factor = data->correction_factor[data->bucket];
 	new_factor -= new_factor / DECAY;
 
-	if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
-		new_factor += RESOLUTION * measured_us / data->next_timer_us;
+	if (data->next_timer_ns > 0 && measured_ns < MAX_INTERESTING)
+		new_factor += measured_ns / data->next_timer_ns;
 	else
 		/*
 		 * we were idle so long that we count it as a perfect
@@ -447,7 +439,7 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	data->correction_factor[data->bucket] = new_factor;
 
 	/* update the repeating-pattern data */
-	data->intervals[data->interval_ptr++] = measured_us;
+	data->intervals[data->interval_ptr++] = measured_ns;
 	if (data->interval_ptr >= INTERVALS)
 		data->interval_ptr = 0;
 }
diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index 832a2c3f01ff..e2fa2cf4f8a5 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -292,11 +292,17 @@ static ssize_t show_state_##_name(struct cpuidle_state *state, \
 	return sprintf(buf, "%s\n", state->_name);\
 }
 
+static ssize_t show_state_time(struct cpuidle_state *state,
+				  struct cpuidle_state_usage *state_usage,
+				  char *buf)
+{
+	return sprintf(buf, "%llu\n", NSEC_PER_USEC * state_usage->time);
+}
+
 define_show_state_function(exit_latency)
 define_show_state_function(target_residency)
 define_show_state_function(power_usage)
 define_show_state_ull_function(usage)
-define_show_state_ull_function(time)
 define_show_state_str_function(name)
 define_show_state_str_function(desc)
 define_show_state_ull_function(disable)
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 786ad32631a6..e6bab06d5502 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -76,7 +76,7 @@ struct cpuidle_device {
 	unsigned int		enabled:1;
 	unsigned int		cpu;
 
-	int			last_residency;
+	u64			last_residency;
 	struct cpuidle_state_usage	states_usage[CPUIDLE_STATE_MAX];
 	struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
 	struct cpuidle_driver_kobj *kobj_driver;
@@ -96,7 +96,7 @@ DECLARE_PER_CPU(struct cpuidle_device, cpuidle_dev);
  * cpuidle_get_last_residency - retrieves the last state's residency time
  * @dev: the target CPU
  */
-static inline int cpuidle_get_last_residency(struct cpuidle_device *dev)
+static inline u64 cpuidle_get_last_residency(struct cpuidle_device *dev)
 {
 	return dev->last_residency;
 }

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* RE: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 19:29                                                                             ` Rik van Riel
@ 2016-03-18 20:59                                                                               ` Doug Smythies
  2016-03-18 21:24                                                                               ` Rafael J. Wysocki
  2016-03-18 23:40                                                                               ` Rafael J. Wysocki
  2 siblings, 0 replies; 59+ messages in thread
From: Doug Smythies @ 2016-03-18 20:59 UTC (permalink / raw)
  To: 'Rik van Riel'
  Cc: 'Rafael J. Wysocki', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On 2106.03.18 12:30 Rik van Riel wrote:
> On Fri, 18 Mar 2016 11:32:28 -0700 Doug Smythies wrote:
>> On 2016.03.18 06:12 Rafael J. Wysocki wrote:

>>> I'm wondering what happens if you replace the expected_interval in the
>>> "expected_interval >
>>> drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
>>> data->next_timer_us (with the Rik's patch applied, of course).  Can
>>> you please try doing that?  
>> 
>> O.K. my reference: rvr6 is the above modification to rvr5
>> It works as well as "reverted"/
>> 
>> State	k45rc7-rjw10-rvr6 (mins)
>> 0.00	0.87
>> 1.00	24.20
>> 2.00	4.05
>> 3.00	1.72
>> 4.00	147.50
>> 
>> total	178.34
>> 
>> Energy:
>> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>> 
>> Trace data (very crude summary):
>> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
>> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
>
> What does "long duration" mean? 
> Dozens of microseconds?
> Hundreds of microseconds?
> Milliseconds?

On average, 100s of milliseconds, and as much as 4 seconds.

Specifically, for the Kernel 4.5-rc7-rjw10-rvr5 case, of 3049:
The average load was 97.2% and the average "Long" duration was 295.2 mSec.
Example 1: CPU 5 load 99.96% duration 1.96 seconds.
Example 2: CPU 5 load 99.74% duration 2.68 seconds.
Example 3: CPU 7 load 97.86% duration 2.30 seconds.

So, to repeat what I said the other day, but in another way:
The estimate can be correct 99.9% (or even more) of the time,
but when it isn't right, and the CPU gets left in idle state 0,
sometimes it can get left there for a very very long time.

> Either way, it appears there is something wrong with the
> code in get_typical_interval.  One of the problems is
> that calculating in microseconds, when working with a
> threshold of 1-2 microseconds is not going to work well,
> and secondly the code declares success the moment the
> standard deviation is below 20 microseconds, which is
> also not the best idea when dealing with 1-2 microsecond
> thresholds :)
>
> Does the below patch help?

I'll report back later on that part. The test computer is busy
with something else at the moment.

... Doug



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 19:29                                                                             ` Rik van Riel
  2016-03-18 20:59                                                                               ` Doug Smythies
@ 2016-03-18 21:24                                                                               ` Rafael J. Wysocki
  2016-03-18 21:26                                                                                 ` Rik van Riel
  2016-03-18 23:40                                                                               ` Rafael J. Wysocki
  2 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 21:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 18, 2016 at 8:29 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 18 Mar 2016 11:32:28 -0700
> "Doug Smythies" <dsmythies@telus.net> wrote:
>> On 2016.03.18 06:12 Rafael J. Wysocki wrote:
>
>> > I'm wondering what happens if you replace the expected_interval in the
>> > "expected_interval >
>> > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
>> > data->next_timer_us (with the Rik's patch applied, of course).  Can
>> > you please try doing that?
>>
>> O.K. my reference: rvr6 is the above modification to rvr5
>> It works as well as "reverted"/
>>
>> State k45rc7-rjw10-rvr6 (mins)
>> 0.00  0.87
>> 1.00  24.20
>> 2.00  4.05
>> 3.00  1.72
>> 4.00  147.50
>>
>> total 178.34
>>
>> Energy:
>> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>>
>> Trace data (very crude summary):
>> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
>> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
>
> What does "long duration" mean?
> Dozens of microseconds?
> Hundreds of microseconds?
> Milliseconds?
>
> Either way, it appears there is something wrong with the
> code in get_typical_interval.  One of the problems is
> that calculating in microseconds, when working with a
> threshold of 1-2 microseconds is not going to work well,
> and secondly the code declares success the moment the
> standard deviation is below 20 microseconds, which is
> also not the best idea when dealing with 1-2 microsecond
> thresholds :)

Well, that might be the reason why 20 was used in the original
data->next_timer_us test ...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:24                                                                               ` Rafael J. Wysocki
@ 2016-03-18 21:26                                                                                 ` Rik van Riel
  0 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2016-03-18 21:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 2064 bytes --]

On Fri, 2016-03-18 at 22:24 +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 8:29 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > On Fri, 18 Mar 2016 11:32:28 -0700
> > "Doug Smythies" <dsmythies@telus.net> wrote:
> > > 
> > > On 2016.03.18 06:12 Rafael J. Wysocki wrote:
> > > 
> > > > 
> > > > I'm wondering what happens if you replace the expected_interval
> > > > in the
> > > > "expected_interval >
> > > > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test
> > > > with
> > > > data->next_timer_us (with the Rik's patch applied, of
> > > > course).  Can
> > > > you please try doing that?
> > > O.K. my reference: rvr6 is the above modification to rvr5
> > > It works as well as "reverted"/
> > > 
> > > State k45rc7-rjw10-rvr6 (mins)
> > > 0.00  0.87
> > > 1.00  24.20
> > > 2.00  4.05
> > > 3.00  1.72
> > > 4.00  147.50
> > > 
> > > total 178.34
> > > 
> > > Energy:
> > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > > 
> > > Trace data (very crude summary):
> > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load
> > > (idle state 0)
> > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less,
> > > CPU load (not all idle state 0)
> > What does "long duration" mean?
> > Dozens of microseconds?
> > Hundreds of microseconds?
> > Milliseconds?
> > 
> > Either way, it appears there is something wrong with the
> > code in get_typical_interval.  One of the problems is
> > that calculating in microseconds, when working with a
> > threshold of 1-2 microseconds is not going to work well,
> > and secondly the code declares success the moment the
> > standard deviation is below 20 microseconds, which is
> > also not the best idea when dealing with 1-2 microsecond
> > thresholds :)
> Well, that might be the reason why 20 was used in the original
> data->next_timer_us test ...

Good point, and it looks like Doug's problem is something
else.  Let me send a test patch for Doug in another email...

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 18:32                                                                           ` Doug Smythies
  2016-03-18 19:29                                                                             ` Rik van Riel
@ 2016-03-18 21:35                                                                             ` Rik van Riel
  2016-03-18 21:48                                                                               ` Rafael J. Wysocki
  2016-03-18 22:56                                                                               ` Rafael J. Wysocki
  1 sibling, 2 replies; 59+ messages in thread
From: Rik van Riel @ 2016-03-18 21:35 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael J. Wysocki',
	'Viresh Kumar', 'Srinivas Pandruvada',
	'Chen, Yu C', linux-pm, 'Arto Jantunen',
	'Len Brown'

On Fri, 18 Mar 2016 11:32:28 -0700
"Doug Smythies" <dsmythies@telus.net> wrote:

> Energy:
> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> 
> Trace data (very crude summary):
> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)

This is a bit of a big hammer, but since the polling loop already uses
full CPU power anyway, this could be harmless, and solve your problem.

---8<---

Subject: cpuidle: break out of idle polling loop after HLT threshold

Staying in the poll loop can be beneficial for workloads that
wake back up very, very quickly, but staying in the poll loop
for too long can waste a lot of power.

Break out of the idle polling loop if the system has spent more
time in the loop than the threshold for entering the next power
saving mode.

The poll loop uses full CPU power already, so this will not
increase the power use of the poll loop.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/driver.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index 389ade4572be..a5dfeafd243f 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv)
 static int poll_idle(struct cpuidle_device *dev,
 		struct cpuidle_driver *drv, int index)
 {
+	/*
+	 * Polling is state 0; break out of the polling loop after the
+	 * threshold for the next power state has passed. Doing this several
+	 * times in a row should cause the menu governor to immediately
+	 * select a deeper power state.
+	 */
+	u64 limit = drv->states[1].target_residency * NSEC_PER_USEC;
+	ktime_t start = ktime_get();
+	int i = 0;
+
 	local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		while (!need_resched())
+		while (!need_resched()) {
+			ktime_t now;
 			cpu_relax();
+
+			/* Occasionally check for a timeout. */
+			if (!(i % 1000)) {
+				now = ktime_get();
+				if (ktime_to_ns(ktime_sub(now, start)) > limit)
+					break;
+			}
+		}
 	}
 	current_clr_polling();
 

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:35                                                                             ` Rik van Riel
@ 2016-03-18 21:48                                                                               ` Rafael J. Wysocki
  2016-03-18 21:52                                                                                 ` Rik van Riel
  2016-03-18 21:56                                                                                 ` Rafael J. Wysocki
  2016-03-18 22:56                                                                               ` Rafael J. Wysocki
  1 sibling, 2 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 21:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 18 Mar 2016 11:32:28 -0700
> "Doug Smythies" <dsmythies@telus.net> wrote:
>
>> Energy:
>> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>>
>> Trace data (very crude summary):
>> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
>> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
>
> This is a bit of a big hammer, but since the polling loop already uses
> full CPU power anyway, this could be harmless, and solve your problem.

Yes, it would be good to test this.

No, I don't like it.

If we are to break out of the polling loop after a specific time, we
shouldn't be using polling at all in the first place.

So I very much prefer to compare data->next_timer_us with the target
residency of C1 instead of doing this.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:48                                                                               ` Rafael J. Wysocki
@ 2016-03-18 21:52                                                                                 ` Rik van Riel
  2016-03-18 22:09                                                                                   ` Rafael J. Wysocki
  2016-03-18 21:56                                                                                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-18 21:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]

On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > On Fri, 18 Mar 2016 11:32:28 -0700
> > "Doug Smythies" <dsmythies@telus.net> wrote:
> > 
> > > 
> > > Energy:
> > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > > 
> > > Trace data (very crude summary):
> > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load
> > > (idle state 0)
> > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less,
> > > CPU load (not all idle state 0)
> > This is a bit of a big hammer, but since the polling loop already
> > uses
> > full CPU power anyway, this could be harmless, and solve your
> > problem.
> Yes, it would be good to test this.
> 
> No, I don't like it.
> 
> If we are to break out of the polling loop after a specific time, we
> shouldn't be using polling at all in the first place.
> 
> So I very much prefer to compare data->next_timer_us with the target
> residency of C1 instead of doing this.

The problem is that data->next_timer_us does not tell us
when the next network packet is about to arrive, though.

This could cause issues when dealing with a fast stream
of network traffic, which get_typical_interval() would
be able to deal with.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:48                                                                               ` Rafael J. Wysocki
  2016-03-18 21:52                                                                                 ` Rik van Riel
@ 2016-03-18 21:56                                                                                 ` Rafael J. Wysocki
  2016-03-18 22:38                                                                                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 21:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rik van Riel, Doug Smythies, Viresh Kumar, Srinivas Pandruvada,
	Chen, Yu C, linux-pm, Arto Jantunen, Len Brown

On Friday, March 18, 2016 10:48:45 PM Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote:
> > On Fri, 18 Mar 2016 11:32:28 -0700
> > "Doug Smythies" <dsmythies@telus.net> wrote:
> >
> >> Energy:
> >> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> >>
> >> Trace data (very crude summary):
> >> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
> >> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
> >
> > This is a bit of a big hammer, but since the polling loop already uses
> > full CPU power anyway, this could be harmless, and solve your problem.
> 
> Yes, it would be good to test this.
> 
> No, I don't like it.
> 
> If we are to break out of the polling loop after a specific time, we
> shouldn't be using polling at all in the first place.
> 
> So I very much prefer to compare data->next_timer_us with the target
> residency of C1 instead of doing this.

IOW, something like the below (on top of the $subject patch).

---
 drivers/cpuidle/governors/menu.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -321,7 +321,7 @@ static int menu_select(struct cpuidle_dr
 		 * unless the timer is happening really really soon, or
 		 * C1's exit latency exceeds the user configured limit.
 		 */
-		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
+		if (data->next_timer_us > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
 		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:52                                                                                 ` Rik van Riel
@ 2016-03-18 22:09                                                                                   ` Rafael J. Wysocki
  2016-03-18 22:28                                                                                     ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 22:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote:
> On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote:
> > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
> > wrote:
> > > 
> > > On Fri, 18 Mar 2016 11:32:28 -0700
> > > "Doug Smythies" <dsmythies@telus.net> wrote:
> > > 
> > > > 
> > > > Energy:
> > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > > > 
> > > > Trace data (very crude summary):
> > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load
> > > > (idle state 0)
> > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less,
> > > > CPU load (not all idle state 0)
> > > This is a bit of a big hammer, but since the polling loop already
> > > uses
> > > full CPU power anyway, this could be harmless, and solve your
> > > problem.
> > Yes, it would be good to test this.
> > 
> > No, I don't like it.
> > 
> > If we are to break out of the polling loop after a specific time, we
> > shouldn't be using polling at all in the first place.
> > 
> > So I very much prefer to compare data->next_timer_us with the target
> > residency of C1 instead of doing this.
> 
> The problem is that data->next_timer_us does not tell us
> when the next network packet is about to arrive, though.
> 
> This could cause issues when dealing with a fast stream
> of network traffic, which get_typical_interval() would
> be able to deal with.

If it predicted things correctly for small intervals, that is.

It doesn't seem to be able to do that, however.

You seem to want to trade energy consumption for performance with a very narrow
use case in mind and that by changing bahavior that's been there pretty much
forever, so by now it really has become the expected one.

Needless to say I'm not a fan of changes like that.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 22:09                                                                                   ` Rafael J. Wysocki
@ 2016-03-18 22:28                                                                                     ` Rik van Riel
  2016-03-18 23:29                                                                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-18 22:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 2628 bytes --]

On Fri, 2016-03-18 at 23:09 +0100, Rafael J. Wysocki wrote:
> On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote:
> > 
> > On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote:
> > > 
> > > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
> > > wrote:
> > > > 
> > > > 
> > > > On Fri, 18 Mar 2016 11:32:28 -0700
> > > > "Doug Smythies" <dsmythies@telus.net> wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > Energy:
> > > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > > > > 
> > > > > Trace data (very crude summary):
> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU
> > > > > load
> > > > > (idle state 0)
> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but
> > > > > less,
> > > > > CPU load (not all idle state 0)
> > > > This is a bit of a big hammer, but since the polling loop
> > > > already
> > > > uses
> > > > full CPU power anyway, this could be harmless, and solve your
> > > > problem.
> > > Yes, it would be good to test this.
> > > 
> > > No, I don't like it.
> > > 
> > > If we are to break out of the polling loop after a specific time,
> > > we
> > > shouldn't be using polling at all in the first place.
> > > 
> > > So I very much prefer to compare data->next_timer_us with the
> > > target
> > > residency of C1 instead of doing this.
> > The problem is that data->next_timer_us does not tell us
> > when the next network packet is about to arrive, though.
> > 
> > This could cause issues when dealing with a fast stream
> > of network traffic, which get_typical_interval() would
> > be able to deal with.
> If it predicted things correctly for small intervals, that is.
> 
> It doesn't seem to be able to do that, however.

It appears to get it right the vast majority of the
time, it is just that when it gets wrong in a tiny
minority of the cases, the penalty for polling is
really high.

> You seem to want to trade energy consumption for performance with a
> very narrow
> use case in mind and that by changing bahavior that's been there
> pretty much
> forever, so by now it really has become the expected one.
> 
> Needless to say I'm not a fan of changes like that.

I am not sure it is a narrow use case, and I suspect
cases with over half a million wakeups a second are
not going to be all that unusual.

The same thing applies not just to network traffic,
but also to tasks ping-ponging info between CPUs,
eg. because some tasks are pinned, or because we
are dealing with a controller thread and a larger
number of workers.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:56                                                                                 ` Rafael J. Wysocki
@ 2016-03-18 22:38                                                                                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 22:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

On Friday, March 18, 2016 10:56:38 PM Rafael J. Wysocki wrote:
> On Friday, March 18, 2016 10:48:45 PM Rafael J. Wysocki wrote:
> > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote:
> > > On Fri, 18 Mar 2016 11:32:28 -0700
> > > "Doug Smythies" <dsmythies@telus.net> wrote:
> > >
> > >> Energy:
> > >> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > >>
> > >> Trace data (very crude summary):
> > >> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
> > >> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
> > >
> > > This is a bit of a big hammer, but since the polling loop already uses
> > > full CPU power anyway, this could be harmless, and solve your problem.
> > 
> > Yes, it would be good to test this.
> > 
> > No, I don't like it.
> > 
> > If we are to break out of the polling loop after a specific time, we
> > shouldn't be using polling at all in the first place.
> > 
> > So I very much prefer to compare data->next_timer_us with the target
> > residency of C1 instead of doing this.
> 
> IOW, something like the below (on top of the $subject patch).
> 
> ---
>  drivers/cpuidle/governors/menu.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -321,7 +321,7 @@ static int menu_select(struct cpuidle_dr
>  		 * unless the timer is happening really really soon, or
>  		 * C1's exit latency exceeds the user configured limit.
>  		 */
> -		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
> +		if (data->next_timer_us > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
>  		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
>  		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
>  		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
> 

Actually, that may be made somewhat less aggressive:

---
 drivers/cpuidle/governors/menu.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -315,13 +315,18 @@ static int menu_select(struct cpuidle_dr
 	expected_interval = min(expected_interval, data->next_timer_us);
 
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
+		unsigned int next_timer_limit_us = drv->states[CPUIDLE_DRIVER_STATE_START].target_residency;
+
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
 		 * unless the timer is happening really really soon, or
 		 * C1's exit latency exceeds the user configured limit.
 		 */
-		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
+		if (next_timer_limit_us < 20)
+			next_timer_limit_us = 20;
+
+		if (data->next_timer_us > next_timer_limit_us &&
 		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
 		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)

except that next_timer_limit_us only needs to be computed once for the given
driver, because it won't change.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 21:35                                                                             ` Rik van Riel
  2016-03-18 21:48                                                                               ` Rafael J. Wysocki
@ 2016-03-18 22:56                                                                               ` Rafael J. Wysocki
  2016-03-19  1:53                                                                                 ` Rik van Riel
  1 sibling, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 22:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 18 Mar 2016 11:32:28 -0700
> "Doug Smythies" <dsmythies@telus.net> wrote:
>
>> Energy:
>> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>>
>> Trace data (very crude summary):
>> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
>> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
>
> This is a bit of a big hammer, but since the polling loop already uses
> full CPU power anyway, this could be harmless, and solve your problem.
>
> ---8<---
>
> Subject: cpuidle: break out of idle polling loop after HLT threshold
>
> Staying in the poll loop can be beneficial for workloads that
> wake back up very, very quickly, but staying in the poll loop
> for too long can waste a lot of power.
>
> Break out of the idle polling loop if the system has spent more
> time in the loop than the threshold for entering the next power
> saving mode.
>
> The poll loop uses full CPU power already, so this will not
> increase the power use of the poll loop.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  drivers/cpuidle/driver.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
> index 389ade4572be..a5dfeafd243f 100644
> --- a/drivers/cpuidle/driver.c
> +++ b/drivers/cpuidle/driver.c
> @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv)
>  static int poll_idle(struct cpuidle_device *dev,
>                 struct cpuidle_driver *drv, int index)
>  {
> +       /*
> +        * Polling is state 0; break out of the polling loop after the
> +        * threshold for the next power state has passed. Doing this several
> +        * times in a row should cause the menu governor to immediately
> +        * select a deeper power state.
> +        */
> +       u64 limit = drv->states[1].target_residency * NSEC_PER_USEC;
> +       ktime_t start = ktime_get();
> +       int i = 0;
> +
>         local_irq_enable();
>         if (!current_set_polling_and_test()) {
> -               while (!need_resched())
> +               while (!need_resched()) {
> +                       ktime_t now;
>                         cpu_relax();
> +
> +                       /* Occasionally check for a timeout. */
> +                       if (!(i % 1000)) {
> +                               now = ktime_get();
> +                               if (ktime_to_ns(ktime_sub(now, start)) > limit)
> +                                       break;
> +                       }
> +               }
>         }
>         current_clr_polling();

I'm not sure how this is going to help, but maybe I'm missing something.

Say we hit the timeout and break out of the loop.  We get back to
cpuidle_enter_state() and from there (via a couple of irrelevant
steps) to cpuidle_idle_call() and to cpu_idle_loop().  Then, since we
are still idle, the while (1) loop continues and we land in
menu_select() again.  That will choose polling again (because now
there's less time to the next timer than it was last time, so if it
didn't choose C1 then, it won't do that now either) and we land in
idle_poll() again.

So is there anything I'm overlooking?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 22:28                                                                                     ` Rik van Riel
@ 2016-03-18 23:29                                                                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 23:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Doug Smythies,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 18, 2016 at 11:28 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 2016-03-18 at 23:09 +0100, Rafael J. Wysocki wrote:
>> On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote:
>> >
>> > On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote:
>> > >
>> > > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
>> > > wrote:
>> > > >
>> > > >
>> > > > On Fri, 18 Mar 2016 11:32:28 -0700
>> > > > "Doug Smythies" <dsmythies@telus.net> wrote:
>> > > >
>> > > > >
>> > > > >
>> > > > > Energy:
>> > > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>> > > > >
>> > > > > Trace data (very crude summary):
>> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU
>> > > > > load
>> > > > > (idle state 0)
>> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but
>> > > > > less,
>> > > > > CPU load (not all idle state 0)
>> > > > This is a bit of a big hammer, but since the polling loop
>> > > > already
>> > > > uses
>> > > > full CPU power anyway, this could be harmless, and solve your
>> > > > problem.
>> > > Yes, it would be good to test this.
>> > >
>> > > No, I don't like it.
>> > >
>> > > If we are to break out of the polling loop after a specific time,
>> > > we
>> > > shouldn't be using polling at all in the first place.
>> > >
>> > > So I very much prefer to compare data->next_timer_us with the
>> > > target
>> > > residency of C1 instead of doing this.
>> > The problem is that data->next_timer_us does not tell us
>> > when the next network packet is about to arrive, though.
>> >
>> > This could cause issues when dealing with a fast stream
>> > of network traffic, which get_typical_interval() would
>> > be able to deal with.
>> If it predicted things correctly for small intervals, that is.
>>
>> It doesn't seem to be able to do that, however.
>
> It appears to get it right the vast majority of the
> time, it is just that when it gets wrong in a tiny
> minority of the cases, the penalty for polling is
> really high.

The overall penalty is what matters, though (and it is cumulative,
unfortunately).

>> You seem to want to trade energy consumption for performance with a
>> very narrow
>> use case in mind and that by changing bahavior that's been there
>> pretty much
>> forever, so by now it really has become the expected one.
>>
>> Needless to say I'm not a fan of changes like that.
>
> I am not sure it is a narrow use case, and I suspect
> cases with over half a million wakeups a second are
> not going to be all that unusual.

Well, the definition of "unusual" is somewhat elusive. :-)

> The same thing applies not just to network traffic,
> but also to tasks ping-ponging info between CPUs,
> eg. because some tasks are pinned, or because we
> are dealing with a controller thread and a larger
> number of workers.

All true, but still 10% energy consumption increase is rather a big deal.

I'd like to hear if there's any difference from going to nanosecond
time scale in the first place at this point.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 19:29                                                                             ` Rik van Riel
  2016-03-18 20:59                                                                               ` Doug Smythies
  2016-03-18 21:24                                                                               ` Rafael J. Wysocki
@ 2016-03-18 23:40                                                                               ` Rafael J. Wysocki
  2 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-18 23:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Fri, Mar 18, 2016 at 8:29 PM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 18 Mar 2016 11:32:28 -0700
> "Doug Smythies" <dsmythies@telus.net> wrote:
>> On 2016.03.18 06:12 Rafael J. Wysocki wrote:
>
>> > I'm wondering what happens if you replace the expected_interval in the
>> > "expected_interval >
>> > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
>> > data->next_timer_us (with the Rik's patch applied, of course).  Can
>> > you please try doing that?
>>
>> O.K. my reference: rvr6 is the above modification to rvr5
>> It works as well as "reverted"/
>>
>> State k45rc7-rjw10-rvr6 (mins)
>> 0.00  0.87
>> 1.00  24.20
>> 2.00  4.05
>> 3.00  1.72
>> 4.00  147.50
>>
>> total 178.34
>>
>> Energy:
>> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>>
>> Trace data (very crude summary):
>> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
>> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)
>
> What does "long duration" mean?
> Dozens of microseconds?
> Hundreds of microseconds?
> Milliseconds?
>
> Either way, it appears there is something wrong with the
> code in get_typical_interval.  One of the problems is
> that calculating in microseconds, when working with a
> threshold of 1-2 microseconds is not going to work well,
> and secondly the code declares success the moment the
> standard deviation is below 20 microseconds, which is
> also not the best idea when dealing with 1-2 microsecond
> thresholds :)
>
> Does the below patch help?
>
> If it does, we will probably want to change the values in
> the driver tables to nanoseconds too, so we can get rid of
> the multiplications :)
>
> ---8<---
>
> Subject: cpuidle: track time in nsecs not usecs
>
> The cpuidle code currently tracks time in microseconds, which
> has a few issues. For one, the kernel natively tracks time in
> nanoseconds, and we do a divide by 1000 when tracking how long
> the CPU spent in an idle state.
>
> Secondly, get_typical_interval in the menu governor cannot do
> very useful math when working with small numbers, while the
> HLT cutoff is only 1 or 2 microseconds on many CPUs.
>
> Converting to nanoseconds avoids the expensive divide, and
> also allows get_typical_interval to do calculations that
> may make more sense for smaller intervals.
>
> While we're there, also remove the code that allows
> get_typical_interval to return a value as soon as the
> standard deviation is under 20 microseconds.  That
> threshold makes little sense when C1 and C2 both
> have latencies much lower than that.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  drivers/cpuidle/cpuidle.c          | 10 ++---
>  drivers/cpuidle/governors/ladder.c |  2 +-
>  drivers/cpuidle/governors/menu.c   | 90 +++++++++++++++++---------------------
>  drivers/cpuidle/sysfs.c            |  8 +++-
>  include/linux/cpuidle.h            |  4 +-
>  5 files changed, 56 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index f996efc56605..8a0673880fbe 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -217,18 +217,18 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>         if (!cpuidle_state_is_coupled(drv, entered_state))
>                 local_irq_enable();
>
> -       diff = ktime_to_us(ktime_sub(time_end, time_start));
> -       if (diff > INT_MAX)
> -               diff = INT_MAX;
> +       diff = ktime_to_ns(ktime_sub(time_end, time_start));
> +       if (diff > LONG_MAX)
> +               diff = LONG_MAX;
>
> -       dev->last_residency = (int) diff;
> +       dev->last_residency = diff;
>
>         if (entered_state >= 0) {
>                 /* Update cpuidle counters */
>                 /* This can be moved to within driver enter routine
>                  * but that results in multiple copies of same code.
>                  */
> -               dev->states_usage[entered_state].time += dev->last_residency;
> +               dev->states_usage[entered_state].time += diff;
>                 dev->states_usage[entered_state].usage++;
>         } else {
>                 dev->last_residency = 0;
> diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
> index 63bd5a403e22..6b480a3a3a2e 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -80,7 +80,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
>
>         last_state = &ldev->states[last_idx];
>
> -       last_residency = cpuidle_get_last_residency(dev) - drv->states[last_idx].exit_latency;
> +       last_residency = ktime_to_us(cpuidle_get_last_residency(dev)) - drv->states[last_idx].exit_latency;
>
>         /* consider promotion */
>         if (last_idx < drv->state_count - 1 &&
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index fba867d917f7..69adb6ab7b8c 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -33,9 +33,9 @@
>  #define BUCKETS 12
>  #define INTERVAL_SHIFT 3
>  #define INTERVALS (1UL << INTERVAL_SHIFT)
> -#define RESOLUTION 1024
> +#define RESOLUTION NSEC_PER_USEC
>  #define DECAY 8
> -#define MAX_INTERESTING 50000
> +#define MAX_INTERESTING 4000000
>
>
>  /*
> @@ -122,11 +122,11 @@ struct menu_device {
>         int             last_state_idx;
>         int             needs_update;
>
> -       unsigned int    next_timer_us;
> -       unsigned int    predicted_us;
> +       u64             next_timer_ns;
> +       u64             predicted_ns;
>         unsigned int    bucket;
>         unsigned int    correction_factor[BUCKETS];
> -       unsigned int    intervals[INTERVALS];
> +       u64             intervals[INTERVALS];
>         int             interval_ptr;
>  };
>
> @@ -152,15 +152,15 @@ static inline int which_bucket(unsigned int duration, unsigned long nr_iowaiters
>         if (nr_iowaiters)
>                 bucket = BUCKETS/2;
>
> -       if (duration < 10)
> +       if (duration < 10000)
>                 return bucket;
> -       if (duration < 100)
> +       if (duration < 100000)
>                 return bucket + 1;
> -       if (duration < 1000)
> +       if (duration < 1000000)
>                 return bucket + 2;
> -       if (duration < 10000)
> +       if (duration < 10000000)
>                 return bucket + 3;
> -       if (duration < 100000)
> +       if (duration < 100000000)
>                 return bucket + 4;
>         return bucket + 5;
>  }
> @@ -242,21 +242,12 @@ static unsigned int get_typical_interval(struct menu_device *data)
>          * The typical interval is obtained when standard deviation is small
>          * or standard deviation is small compared to the average interval.
>          *
> -        * int_sqrt() formal parameter type is unsigned long. When the
> -        * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
> -        * the resulting squared standard deviation exceeds the input domain
> -        * of int_sqrt on platforms where unsigned long is 32 bits in size.
> -        * In such case reject the candidate average.
> -        *
> -        * Use this result only if there is no timer to wake us up sooner.
> +        * Instead of taking the square root of the standard deviation,
> +        * use the square of the average. Be careful to avoid overflow.
>          */
> -       if (likely(stddev <= ULONG_MAX)) {
> -               stddev = int_sqrt(stddev);
> -               if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
> -                                                       || stddev <= 20) {
> -                       return avg;
> -               }
> -       }
> +       if (avg < INT_MAX && (avg * avg) > stddev * 9 &&
> +                                       divisor * 4 >= INTERVALS * 3)
> +               return avg;
>
>         /*
>          * If we have outliers to the upside in our distribution, discard

This has to be rebased on top of the current Linus' tree.  It already
uses variance here.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-18 22:56                                                                               ` Rafael J. Wysocki
@ 2016-03-19  1:53                                                                                 ` Rik van Riel
  2016-03-19  2:06                                                                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-19  1:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 4969 bytes --]

On Fri, 2016-03-18 at 23:56 +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > On Fri, 18 Mar 2016 11:32:28 -0700
> > "Doug Smythies" <dsmythies@telus.net> wrote:
> > 
> > > 
> > > Energy:
> > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
> > > 
> > > Trace data (very crude summary):
> > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load
> > > (idle state 0)
> > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less,
> > > CPU load (not all idle state 0)
> > This is a bit of a big hammer, but since the polling loop already
> > uses
> > full CPU power anyway, this could be harmless, and solve your
> > problem.
> > 
> > ---8<---
> > 
> > Subject: cpuidle: break out of idle polling loop after HLT
> > threshold
> > 
> > Staying in the poll loop can be beneficial for workloads that
> > wake back up very, very quickly, but staying in the poll loop
> > for too long can waste a lot of power.
> > 
> > Break out of the idle polling loop if the system has spent more
> > time in the loop than the threshold for entering the next power
> > saving mode.
> > 
> > The poll loop uses full CPU power already, so this will not
> > increase the power use of the poll loop.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > ---
> >  drivers/cpuidle/driver.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
> > index 389ade4572be..a5dfeafd243f 100644
> > --- a/drivers/cpuidle/driver.c
> > +++ b/drivers/cpuidle/driver.c
> > @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct
> > cpuidle_driver *drv)
> >  static int poll_idle(struct cpuidle_device *dev,
> >                 struct cpuidle_driver *drv, int index)
> >  {
> > +       /*
> > +        * Polling is state 0; break out of the polling loop after
> > the
> > +        * threshold for the next power state has passed. Doing
> > this several
> > +        * times in a row should cause the menu governor to
> > immediately
> > +        * select a deeper power state.
> > +        */
> > +       u64 limit = drv->states[1].target_residency *
> > NSEC_PER_USEC;
> > +       ktime_t start = ktime_get();
> > +       int i = 0;
> > +
> >         local_irq_enable();
> >         if (!current_set_polling_and_test()) {
> > -               while (!need_resched())
> > +               while (!need_resched()) {
> > +                       ktime_t now;
> >                         cpu_relax();
> > +
> > +                       /* Occasionally check for a timeout. */
> > +                       if (!(i % 1000)) {
> > +                               now = ktime_get();
> > +                               if (ktime_to_ns(ktime_sub(now,
> > start)) > limit)
> > +                                       break;
> > +                       }
> > +               }
> >         }
> >         current_clr_polling();
> I'm not sure how this is going to help, but maybe I'm missing
> something.
> 
> Say we hit the timeout and break out of the loop.  We get back to
> cpuidle_enter_state() and from there (via a couple of irrelevant
> steps) to cpuidle_idle_call() and to cpu_idle_loop().  Then, since we
> are still idle, the while (1) loop continues and we land in
> menu_select() again.  That will choose polling again (because now
> there's less time to the next timer than it was last time, so if it
> didn't choose C1 then, it won't do that now either) and we land in
> idle_poll() again.
> 
> So is there anything I'm overlooking?

After about three microseconds of the above, get_typical_interval()
will return an interval larger than the C1 target residency, or
fail to find a typical interval, because 3 of the 8 intervals will
exceed the C1 target residency at that point.

Only repeated very fast wakeups will lead to polling, and just
a few periods of polling timing out will lead to the system
reverting to HLT.

In other words, the system will only use polling while it appears
to be beneficial, and should snap out of that mode very quickly.

The kernel test robot also showed over 6% improvement in (IIRC)
the wikibench test, with polling being done a little more often.

Other parts of the kernel do pretty much anything for 2% extra
performance, so if we can find a way to preserve that 6%, without
hurting power usage significantly, I suspect it may be worthwhile.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-19  1:53                                                                                 ` Rik van Riel
@ 2016-03-19  2:06                                                                                   ` Rafael J. Wysocki
  2016-03-19  2:17                                                                                     ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-19  2:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com> wrote:
> On Fri, 2016-03-18 at 23:56 +0100, Rafael J. Wysocki wrote:
>> On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > On Fri, 18 Mar 2016 11:32:28 -0700
>> > "Doug Smythies" <dsmythies@telus.net> wrote:
>> >
>> > >
>> > > Energy:
>> > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules
>> > >
>> > > Trace data (very crude summary):
>> > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load
>> > > (idle state 0)
>> > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less,
>> > > CPU load (not all idle state 0)
>> > This is a bit of a big hammer, but since the polling loop already
>> > uses
>> > full CPU power anyway, this could be harmless, and solve your
>> > problem.
>> >
>> > ---8<---
>> >
>> > Subject: cpuidle: break out of idle polling loop after HLT
>> > threshold
>> >
>> > Staying in the poll loop can be beneficial for workloads that
>> > wake back up very, very quickly, but staying in the poll loop
>> > for too long can waste a lot of power.
>> >
>> > Break out of the idle polling loop if the system has spent more
>> > time in the loop than the threshold for entering the next power
>> > saving mode.
>> >
>> > The poll loop uses full CPU power already, so this will not
>> > increase the power use of the poll loop.
>> >
>> > Signed-off-by: Rik van Riel <riel@redhat.com>
>> > ---
>> >  drivers/cpuidle/driver.c | 21 ++++++++++++++++++++-
>> >  1 file changed, 20 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
>> > index 389ade4572be..a5dfeafd243f 100644
>> > --- a/drivers/cpuidle/driver.c
>> > +++ b/drivers/cpuidle/driver.c
>> > @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct
>> > cpuidle_driver *drv)
>> >  static int poll_idle(struct cpuidle_device *dev,
>> >                 struct cpuidle_driver *drv, int index)
>> >  {
>> > +       /*
>> > +        * Polling is state 0; break out of the polling loop after
>> > the
>> > +        * threshold for the next power state has passed. Doing
>> > this several
>> > +        * times in a row should cause the menu governor to
>> > immediately
>> > +        * select a deeper power state.
>> > +        */
>> > +       u64 limit = drv->states[1].target_residency *
>> > NSEC_PER_USEC;
>> > +       ktime_t start = ktime_get();
>> > +       int i = 0;
>> > +
>> >         local_irq_enable();
>> >         if (!current_set_polling_and_test()) {
>> > -               while (!need_resched())
>> > +               while (!need_resched()) {
>> > +                       ktime_t now;
>> >                         cpu_relax();
>> > +
>> > +                       /* Occasionally check for a timeout. */
>> > +                       if (!(i % 1000)) {
>> > +                               now = ktime_get();
>> > +                               if (ktime_to_ns(ktime_sub(now,
>> > start)) > limit)
>> > +                                       break;
>> > +                       }
>> > +               }
>> >         }
>> >         current_clr_polling();
>> I'm not sure how this is going to help, but maybe I'm missing
>> something.
>>
>> Say we hit the timeout and break out of the loop.  We get back to
>> cpuidle_enter_state() and from there (via a couple of irrelevant
>> steps) to cpuidle_idle_call() and to cpu_idle_loop().  Then, since we
>> are still idle, the while (1) loop continues and we land in
>> menu_select() again.  That will choose polling again (because now
>> there's less time to the next timer than it was last time, so if it
>> didn't choose C1 then, it won't do that now either) and we land in
>> idle_poll() again.
>>
>> So is there anything I'm overlooking?
>
> After about three microseconds of the above, get_typical_interval()
> will return an interval larger than the C1 target residency, or
> fail to find a typical interval, because 3 of the 8 intervals will
> exceed the C1 target residency at that point.
>
> Only repeated very fast wakeups will lead to polling, and just
> a few periods of polling timing out will lead to the system
> reverting to HLT.
>
> In other words, the system will only use polling while it appears
> to be beneficial, and should snap out of that mode very quickly.

OK, thanks!

> The kernel test robot also showed over 6% improvement in (IIRC)
> the wikibench test, with polling being done a little more often.
>
> Other parts of the kernel do pretty much anything for 2% extra
> performance, so if we can find a way to preserve that 6%, without
> hurting power usage significantly, I suspect it may be worthwhile.

Well. OK.  That is a worthy goal, but things in this thread look a bit
too ad-hoc to me to be honest.

I'd like to take one step at a time and in such a way that
improvements are actually confirmed in every step.

Does this sound reasonable?

If so, I first would like to restore the energy consumption in the
Doug's case at least to the level from before commit a9ceb78bc75c
(cpuidle,menu: use interactivity_req to disable polling).  That's what
I'd like to do for 4.6 and possibly before the merge window is over.

After 4.6-rc1, in the 4.7 cycle, we can think about improving
performance here and maybe sacrificing some energy for that, but in a
controlled way as I said above.

What do you think?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-19  2:06                                                                                   ` Rafael J. Wysocki
@ 2016-03-19  2:17                                                                                     ` Rik van Riel
  2016-03-19  2:33                                                                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2016-03-19  2:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Chen, Yu C, linux-pm, Arto Jantunen,
	Len Brown

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Sat, 2016-03-19 at 03:06 +0100, Rafael J. Wysocki wrote:
> On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com>
> wrote:
> > 
> > The kernel test robot also showed over 6% improvement in (IIRC)
> > the wikibench test, with polling being done a little more often.
> > 
> > Other parts of the kernel do pretty much anything for 2% extra
> > performance, so if we can find a way to preserve that 6%, without
> > hurting power usage significantly, I suspect it may be worthwhile.
> Well. OK.  That is a worthy goal, but things in this thread look a
> bit
> too ad-hoc to me to be honest.
> 
> I'd like to take one step at a time and in such a way that
> improvements are actually confirmed in every step.
> 
> Does this sound reasonable?
> 
> If so, I first would like to restore the energy consumption in the
> Doug's case at least to the level from before commit a9ceb78bc75c
> (cpuidle,menu: use interactivity_req to disable polling).  That's
> what
> I'd like to do for 4.6 and possibly before the merge window is over.
> 
> After 4.6-rc1, in the 4.7 cycle, we can think about improving
> performance here and maybe sacrificing some energy for that, but in a
> controlled way as I said above.
> 
> What do you think?

That sounds perfectly reasonable to me.

Count me in for any efforts to improve performance with
the cpuidle code, and hopefully allow more people to enable
c-states on their servers.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] cpuidle: use high confidence factors only when considering polling
  2016-03-19  2:17                                                                                     ` Rik van Riel
@ 2016-03-19  2:33                                                                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2016-03-19  2:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Chen, Yu C, linux-pm,
	Arto Jantunen, Len Brown

On Sat, Mar 19, 2016 at 3:17 AM, Rik van Riel <riel@redhat.com> wrote:
> On Sat, 2016-03-19 at 03:06 +0100, Rafael J. Wysocki wrote:
>> On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com>
>> wrote:
>> >
>> > The kernel test robot also showed over 6% improvement in (IIRC)
>> > the wikibench test, with polling being done a little more often.
>> >
>> > Other parts of the kernel do pretty much anything for 2% extra
>> > performance, so if we can find a way to preserve that 6%, without
>> > hurting power usage significantly, I suspect it may be worthwhile.
>> Well. OK.  That is a worthy goal, but things in this thread look a
>> bit
>> too ad-hoc to me to be honest.
>>
>> I'd like to take one step at a time and in such a way that
>> improvements are actually confirmed in every step.
>>
>> Does this sound reasonable?
>>
>> If so, I first would like to restore the energy consumption in the
>> Doug's case at least to the level from before commit a9ceb78bc75c
>> (cpuidle,menu: use interactivity_req to disable polling).  That's
>> what
>> I'd like to do for 4.6 and possibly before the merge window is over.
>>
>> After 4.6-rc1, in the 4.7 cycle, we can think about improving
>> performance here and maybe sacrificing some energy for that, but in a
>> controlled way as I said above.
>>
>> What do you think?
>
> That sounds perfectly reasonable to me.
>
> Count me in for any efforts to improve performance with
> the cpuidle code, and hopefully allow more people to enable
> c-states on their servers.

OK, great!

I'm going to send a patch for Doug to test that I'd like to apply for
4.6 (on top of http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=e132b9b3bc7f19e9b158e42b323881d5dee5ecf3),
maybe not today (because I'm falling asleep now), but likely tomorrow
or on Sunday.  If that restores things for him, I'll apply it and that
will be the starting point for future changes.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-03-19  2:33 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-01 17:51 SKL BOOT FAILURE unless idle=nomwait (was Re: PROBLEM: Cpufreq constantly keeps frequency at maximum on 4.5-rc4) Len Brown
     [not found] ` <87si087tsr.fsf@iki.fi>
2016-03-02 17:10   ` Rik van Riel
2016-03-08 21:13     ` Len Brown
2016-03-08 21:19   ` Len Brown
2016-03-09 17:01     ` Arto Jantunen
2016-03-09 23:03       ` Doug Smythies
2016-03-09 23:18         ` Rafael J. Wysocki
2016-03-09 23:45           ` Doug Smythies
2016-03-09 23:59             ` Rafael J. Wysocki
2016-03-11 14:03               ` Rik van Riel
2016-03-11 18:22                 ` Doug Smythies
2016-03-11 20:30                   ` Rik van Riel
2016-03-11 23:54                     ` Rafael J. Wysocki
2016-03-12  0:46                       ` Doug Smythies
2016-03-12  1:45                         ` Rafael J. Wysocki
2016-03-12  2:02                           ` Rafael J. Wysocki
2016-03-13  7:46                             ` Doug Smythies
2016-03-14  1:32                               ` Rafael J. Wysocki
2016-03-14  6:39                                 ` Doug Smythies
2016-03-14 12:47                                   ` Rafael J. Wysocki
2016-03-14 14:31                                     ` Rik van Riel
2016-03-14 15:21                                       ` Rafael J. Wysocki
2016-03-14 17:45                                         ` Rik van Riel
2016-03-14 22:55                                           ` Rafael J. Wysocki
2016-03-15  2:03                                             ` Rik van Riel
2016-03-16  0:32                                               ` Rafael J. Wysocki
2016-03-16  0:37                                                 ` Rafael J. Wysocki
2016-03-16  0:55                                                 ` Rik van Riel
2016-03-16  1:53                                                   ` Rafael J. Wysocki
2016-03-16 13:02                                                     ` Rafael J. Wysocki
2016-03-16 14:01                                                       ` Rik van Riel
2016-03-16 14:14                                                         ` Rafael J. Wysocki
2016-03-16 14:46                                                           ` Rik van Riel
2016-03-16 15:05                                                             ` Rafael J. Wysocki
2016-03-16 15:07                                                               ` Rik van Riel
2016-03-16 15:09                                                                 ` Rafael J. Wysocki
2016-03-16 16:14                                                                   ` [PATCH] cpuidle: use high confidence factors only when considering polling Rik van Riel
2016-03-18  0:45                                                                     ` Rafael J. Wysocki
2016-03-18  6:32                                                                       ` Doug Smythies
2016-03-18 13:11                                                                         ` Rafael J. Wysocki
2016-03-18 18:32                                                                           ` Doug Smythies
2016-03-18 19:29                                                                             ` Rik van Riel
2016-03-18 20:59                                                                               ` Doug Smythies
2016-03-18 21:24                                                                               ` Rafael J. Wysocki
2016-03-18 21:26                                                                                 ` Rik van Riel
2016-03-18 23:40                                                                               ` Rafael J. Wysocki
2016-03-18 21:35                                                                             ` Rik van Riel
2016-03-18 21:48                                                                               ` Rafael J. Wysocki
2016-03-18 21:52                                                                                 ` Rik van Riel
2016-03-18 22:09                                                                                   ` Rafael J. Wysocki
2016-03-18 22:28                                                                                     ` Rik van Riel
2016-03-18 23:29                                                                                       ` Rafael J. Wysocki
2016-03-18 21:56                                                                                 ` Rafael J. Wysocki
2016-03-18 22:38                                                                                   ` Rafael J. Wysocki
2016-03-18 22:56                                                                               ` Rafael J. Wysocki
2016-03-19  1:53                                                                                 ` Rik van Riel
2016-03-19  2:06                                                                                   ` Rafael J. Wysocki
2016-03-19  2:17                                                                                     ` Rik van Riel
2016-03-19  2:33                                                                                       ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.