linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] cpuidle: poll_state: Add time limit to poll_idle()
@ 2018-03-14 14:08 Rafael J. Wysocki
  2018-03-22 16:32 ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2018-03-14 14:08 UTC (permalink / raw)
  To: Linux PM
  Cc: Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If poll_idle() is allowed to spin until need_resched() returns 'true',
it may actually spin for a much longer time than expected by the idle
governor, since set_tsk_need_resched() is not always called by the
timer interrupt handler.  If that happens, the CPU may spend much
more time than anticipated in the "polling" state.

To prevent that from happening, limit the time of the spinning loop
in poll_idle().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: Use local_clock() for time measurements and drop the
          counter, since that should be lightweight enough (as
          suggested by Peter).

---
 drivers/cpuidle/poll_state.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/cpuidle/poll_state.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/poll_state.c
+++ linux-pm/drivers/cpuidle/poll_state.c
@@ -6,15 +6,24 @@
 
 #include <linux/cpuidle.h>
 #include <linux/sched.h>
+#include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
 
+#define POLL_IDLE_TIME_LIMIT	(TICK_NSEC / 16)
+
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
+	u64 time_start = local_clock();
+
 	local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		while (!need_resched())
+		while (!need_resched()) {
 			cpu_relax();
+
+			if (local_clock() - time_start > POLL_IDLE_TIME_LIMIT)
+				break;
+		}
 	}
 	current_clr_polling();
 

^ permalink raw reply	[flat|nested] 22+ messages in thread
* RE: [PATCH v3] cpuidle: poll_state: Add time limit to poll_idle()
@ 2018-03-14 15:00 Doug Smythies
  2018-03-20 10:52 ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Doug Smythies @ 2018-03-14 15:00 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Peter Zijlstra', 'Frederic Weisbecker',
	'Thomas Gleixner', 'Paul McKenney',
	'Thomas Ilsche', 'Rik van Riel',
	'Aubrey Li', 'Mike Galbraith', 'LKML',
	Doug Smythies, 'Linux PM'

On 2018.03.14 07:09 Rafael J. Wysocki wrote:

... [snip]...

> v2 -> v3: Use local_clock() for time measurements and drop the
>          counter, since that should be lightweight enough (as
>          suggested by Peter).

I have been testing the latest of everything for a couple of days
now, and everything continues to be great.

Note that I was using a POLL_IDLE_TIME_CHECK_COUNT of 1 anyhow, because
I specifically wanted to test the worst case time through the loop.
i.e. I wanted any potential issue to be 1000 times more likely to find.
My problem is that I don't know of a good test for this specifically.

I'll switch to this V3, along with V4 of the "sched/cpuidle: Idle loop
rework" 7 patch set.

As for energy savings for just this patch only, I would refer readers
to my previous test results from late November, [1], as I haven't
re-done those Phoronix tests yet, but I don't expect the results to
differ much.

[1] https://marc.info/?l=linux-pm&m=151154499710125&w=2

... Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread
* RE: [PATCH v3] cpuidle: poll_state: Add time limit to poll_idle()
@ 2018-03-25  0:28 Doug Smythies
  2018-03-25 11:53 ` Rafael J. Wysocki
  2018-03-26  6:01 ` Doug Smythies
  0 siblings, 2 replies; 22+ messages in thread
From: Doug Smythies @ 2018-03-25  0:28 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Peter Zijlstra', 'Frederic Weisbecker',
	'Thomas Gleixner', 'Paul McKenney',
	'Thomas Ilsche', 'Rik van Riel',
	'Aubrey Li', 'Mike Galbraith', 'LKML',
	'Linux PM',
	Doug Smythies

On 2018.03.14 07:04 Rafael J. Wysocki wrote:

> If poll_idle() is allowed to spin until need_resched() returns 'true',
> it may actually spin for a much longer time than expected by the idle
> governor, since set_tsk_need_resched() is not always called by the
> timer interrupt handler.  If that happens, the CPU may spend much
> more time than anticipated in the "polling" state.
>
> To prevent that from happening, limit the time of the spinning loop
> in poll_idle().

...[snip]...

> +#define POLL_IDLE_TIME_LIMIT	(TICK_NSEC / 16)

The other ongoing threads on this aside, potentially, there might
be another issue.

What if the next available idle state, after 0, has a residency
that is greater than TICK_NSEC / 16? Meaning these numbers, for example:

/sys/devices/system/cpu/cpu0/cpuidle/state*/residency

The suggestion is that upon a timeout exit from idle state 0,
the measured_us should maybe be rejected, because the statistics
are being biased and it doesn't seem to correct itself.

Up to 1300% (<- not a typo) extra power consumption has been observed.

Supporting experimental data:

My processor: 
/sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
/sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
/sys/devices/system/cpu/cpu0/cpuidle/state2/residency:20
/sys/devices/system/cpu/cpu0/cpuidle/state3/residency:211 <<< Important
/sys/devices/system/cpu/cpu0/cpuidle/state4/residency:345

A 1000 Hz kernel (TICK_NSEC/16) = 62.5 nsec; idle system:

Idle state 0 time: Typically 0 uSec.
Processor package power: 3.7 watts (steady)

Now, disable idle states 1 and 2:

Idle state 0 time (all 8 CPUs): ~~ 430 Seconds / minute
Processor package power: ~52 watts (1300% more power, 14X)

A 250 Hz kernel (TICK_NSEC/16) = 250 nSec; idle system:

Idle state 0 time: Typically < 1 mSec / minute
Processor package power: 3.7 to 3.8 watts

Now, disable idle states 1 and 2:

Idle state 0 time (all 8 CPUs): Typically 0 to 70 mSecs / minute
Processor package power: 3.7 to 3.8 watts

A 1000 Hz kernel with:

+#define POLL_IDLE_TIME_LIMIT	(TICK_NSEC / 4)

Note: Just for a test. I am not suggesting this should change.

instead. i.e. (TICK_NSEC/4) = 250 nSec.

Idle state 0 time: Typically 0 uSec.
Processor package power: 3.7 watts (steady)

Now, disable idle states 1 and 2:

Idle state 0 time (all 8 CPUs): Typically 0 to 70 mSecs / minute
Processor package power: ~3.8 watts

Note 1: My example is contrived via disabling idle states, so
I don't know if it actually needs to be worried about.

Note 2: I do not know if there is some processor where
cpuidle/state1/residency is > 62.5 nSec.

Note 3: I am trying to figure out a way to test rejecting
measured_us upon timeout exit, but haven't made much progress.

... Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-03-27 21:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-14 14:08 [PATCH v3] cpuidle: poll_state: Add time limit to poll_idle() Rafael J. Wysocki
2018-03-22 16:32 ` Rik van Riel
2018-03-22 17:09   ` Rafael J. Wysocki
2018-03-22 17:19     ` Rik van Riel
2018-03-22 17:24       ` Rafael J. Wysocki
2018-03-25 20:15     ` Rik van Riel
2018-03-25 21:34       ` Rafael J. Wysocki
2018-03-25 21:45         ` Rik van Riel
2018-03-26  5:59         ` Doug Smythies
2018-03-26  7:13         ` Doug Smythies
2018-03-26  9:35           ` Rafael J. Wysocki
2018-03-26 16:32         ` Rik van Riel
2018-03-26 21:44           ` Rafael J. Wysocki
2018-03-26 21:48             ` Rik van Riel
2018-03-27 17:59     ` Rik van Riel
2018-03-27 21:06       ` Rafael J. Wysocki
2018-03-14 15:00 Doug Smythies
2018-03-20 10:52 ` Rafael J. Wysocki
2018-03-25  0:28 Doug Smythies
2018-03-25 11:53 ` Rafael J. Wysocki
2018-03-25 21:41   ` Rafael J. Wysocki
2018-03-26  6:01 ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).