linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework
@ 2018-03-12  9:46 Rafael J. Wysocki
  2018-03-12  9:47 ` [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

Hi All,

Thanks a lot for the feedback so far!

This mostly is a re-send of the v3 except for patch 5 that has been updated
to address the review comments from Frederic and patch 7 which is new in
this series (although I have posted it already in response to the feedback
from Doug).

In my view it has passed the RFC phase, but it still is very much RFT. :-)

The original summary:

On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote:
> 
> The problem is that if we stop the sched tick in
> tick_nohz_idle_enter() and then the idle governor predicts short idle
> duration, we lose regardless of whether or not it is right.
> 
> If it is right, we've lost already, because we stopped the tick
> unnecessarily.  If it is not right, we'll lose going forward, because
> the idle state selected by the governor is going to be too shallow and
> we'll draw too much power (that has been reported recently to actually
> happen often enough for people to care).
> 
> This patch series is an attempt to improve the situation and the idea
> here is to make the decision whether or not to stop the tick deeper in
> the idle loop and in particular after running the idle state selection
> in the path where the idle governor is invoked.  This way the problem
> can be avoided, because the idle duration predicted by the idle governor
> can be used to decide whether or not to stop the tick so that the tick
> is only stopped if that value is large enough (and, consequently, the
> idle state selected by the governor is deep enough).
> 
> The series tires to avoid adding too much new code, rather reorder the
> existing code and make it more fine-grained.

Patch 1 prepares the tick-sched code for the subsequent modifications and it
doesn't change the code's functionality (at least not intentionally).
 
Patch 2 starts pushing the tick stopping decision deeper into the idle
loop, but it is limited to do_idle() and tick_nohz_irq_exit().

Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
and sets the stage for the subsequent changes.

Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select
governor callback allowing them to return a "nohz" hint on whether or not to
stop the tick to the caller.  It also adds code to decide what value to
return as "nohz" to the menu governor.

Patch 5 reorders the idle state selection with respect to the stopping of the
tick and causes the additional "nohz" hint from cpuidle_select() to be used
for deciding whether or not to stop the tick.

Patch 6 causes the menu governor to refine the state selection in case the
tick is not going to be stopped and the already selected state may not fit
before the next tick time.

Patch 7 Deals with the situation in which the tick was stopped previously, but
the idle governor still predicts short idle.

And this still applies:

> I have tested these patches on a couple of machines, including the very laptop
> I'm sending them from, without any obvious issues, but please give them a go
> if you can, especially if you have an easy way to reproduce the problem they
> are targeting.

Also, for better results please test them along with the poll_idle() patch at:

https://patchwork.kernel.org/patch/10275759/

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
@ 2018-03-12  9:47 ` Rafael J. Wysocki
  2018-03-14 15:49   ` Frederic Weisbecker
  2018-03-12  9:51 ` [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12  9:47 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.

Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.

Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/x86/xen/smp_pv.c    |    1 +
 include/linux/tick.h     |    9 +++++++++
 kernel/sched/idle.c      |    1 +
 kernel/time/tick-sched.c |   46 +++++++++++++++++++++++++---------------------
 4 files changed, 36 insertions(+), 21 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
 #ifdef CONFIG_NO_HZ_COMMON
 extern bool tick_nohz_enabled;
 extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_stop_tick(void);
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
@@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cp
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline void tick_nohz_idle_stop_tick(void) { }
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 
@@ -136,6 +138,13 @@ static inline u64 get_cpu_idle_time_us(i
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 #endif /* !CONFIG_NO_HZ_COMMON */
 
+static inline void tick_nohz_idle_stop_tick_protected(void)
+{
+	local_irq_disable();
+	tick_nohz_idle_stop_tick();
+	local_irq_enable();
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 extern bool tick_nohz_full_running;
 extern cpumask_var_t tick_nohz_full_mask;
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -539,14 +539,11 @@ static void tick_nohz_stop_idle(struct t
 	sched_clock_idle_wakeup_event();
 }
 
-static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
+static void tick_nohz_start_idle(struct tick_sched *ts)
 {
-	ktime_t now = ktime_get();
-
-	ts->idle_entrytime = now;
+	ts->idle_entrytime = ktime_get();
 	ts->idle_active = 1;
 	sched_clock_idle_sleep_event();
-	return now;
 }
 
 /**
@@ -911,19 +908,21 @@ static bool can_stop_idle_tick(int cpu,
 	return true;
 }
 
-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
 {
-	ktime_t now, expires;
+	ktime_t expires;
 	int cpu = smp_processor_id();
 
-	now = tick_nohz_start_idle(ts);
-
 	if (can_stop_idle_tick(cpu, ts)) {
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
 
-		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		/*
+		 * The idle entry time should be a sufficient approximation of
+		 * the current time at this point.
+		 */
+		expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
 		if (expires > 0LL) {
 			ts->idle_sleeps++;
 			ts->idle_expires = expires;
@@ -937,16 +936,19 @@ static void __tick_nohz_idle_enter(struc
 }
 
 /**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
+ * tick_nohz_idle_stop_tick - stop the idle tick from the idle task
  *
  * When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
+ */
+void tick_nohz_idle_stop_tick(void)
+{
+	__tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
+/**
+ * tick_nohz_idle_enter - prepare for entering idle on the current CPU
  *
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- *  to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
+ * Called when we start the idle loop.
  */
 void tick_nohz_idle_enter(void)
 {
@@ -965,7 +967,7 @@ void tick_nohz_idle_enter(void)
 
 	ts = this_cpu_ptr(&tick_cpu_sched);
 	ts->inidle = 1;
-	__tick_nohz_idle_enter(ts);
+	tick_nohz_start_idle(ts);
 
 	local_irq_enable();
 }
@@ -982,10 +984,12 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
-	if (ts->inidle)
-		__tick_nohz_idle_enter(ts);
-	else
+	if (ts->inidle) {
+		tick_nohz_start_idle(ts);
+		__tick_nohz_idle_stop_tick(ts);
+	} else {
 		tick_nohz_full_update_tick(ts);
+	}
 }
 
 /**
Index: linux-pm/arch/x86/xen/smp_pv.c
===================================================================
--- linux-pm.orig/arch/x86/xen/smp_pv.c
+++ linux-pm/arch/x86/xen/smp_pv.c
@@ -425,6 +425,7 @@ static void xen_pv_play_dead(void) /* us
 	 * data back is to call:
 	 */
 	tick_nohz_idle_enter();
+	tick_nohz_idle_stop_tick_protected();
 
 	cpuhp_online_idle(CPUHP_AP_ONLINE_IDLE);
 }
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,6 +221,7 @@ static void do_idle(void)
 
 	__current_set_polling();
 	tick_nohz_idle_enter();
+	tick_nohz_idle_stop_tick_protected();
 
 	while (!need_resched()) {
 		check_pgt_cache();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-12  9:47 ` [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-12  9:51 ` Rafael J. Wysocki
  2018-03-15 16:10   ` Frederic Weisbecker
  2018-03-12  9:53 ` [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12  9:51 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the timekeeping code on the duration
of the upcoming idle period.  Specifically, if the tick has been
stopped and the idle governor predicts short idle, the situation is
bad regardless of whether or not the prediction is accurate.  If it
is accurate, the tick has been stopped unnecessarily which means
excessive overhead.  If it is not accurate, the CPU is likely to
spend too much time in the (shallow, because short idle has been
predicted) idle state selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c      |    8 +++++---
 kernel/time/tick-sched.c |    6 ++----
 2 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,13 +221,13 @@ static void do_idle(void)
 
 	__current_set_polling();
 	tick_nohz_idle_enter();
-	tick_nohz_idle_stop_tick_protected();
 
 	while (!need_resched()) {
 		check_pgt_cache();
 		rmb();
 
 		if (cpu_is_offline(cpu)) {
+			tick_nohz_idle_stop_tick_protected();
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
@@ -241,10 +241,12 @@ static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired())
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
 			cpu_idle_poll();
-		else
+		} else {
+			tick_nohz_idle_stop_tick();
 			cpuidle_idle_call();
+		}
 		arch_cpu_idle_exit();
 	}
 
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -984,12 +984,10 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
-	if (ts->inidle) {
+	if (ts->inidle)
 		tick_nohz_start_idle(ts);
-		__tick_nohz_idle_stop_tick(ts);
-	} else {
+	else
 		tick_nohz_full_update_tick(ts);
-	}
 }
 
 /**

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-12  9:47 ` [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
  2018-03-12  9:51 ` [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
@ 2018-03-12  9:53 ` Rafael J. Wysocki
  2018-03-15 18:19   ` Frederic Weisbecker
  2018-03-12  9:54 ` [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12  9:53 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick().  To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c |   25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,15 @@ static void cpuidle_idle_call(void)
 	}
 
 	/*
-	 * Tell the RCU framework we are entering an idle section,
-	 * so no more rcu read side critical sections and one more
+	 * The RCU framework needs to be told that we are entering an idle
+	 * section, so no more rcu read side critical sections and one more
 	 * step to the grace period
 	 */
-	rcu_idle_enter();
 
 	if (cpuidle_not_available(drv, dev)) {
+		tick_nohz_idle_stop_tick();
+		rcu_idle_enter();
+
 		default_idle_call();
 		goto exit_idle;
 	}
@@ -169,16 +171,26 @@ static void cpuidle_idle_call(void)
 
 	if (idle_should_enter_s2idle() || dev->use_deepest_state) {
 		if (idle_should_enter_s2idle()) {
+			rcu_idle_enter();
+
 			entered_state = cpuidle_enter_s2idle(drv, dev);
 			if (entered_state > 0) {
 				local_irq_enable();
 				goto exit_idle;
 			}
+
+			rcu_idle_exit();
 		}
 
+		tick_nohz_idle_stop_tick();
+		rcu_idle_enter();
+
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		tick_nohz_idle_stop_tick();
+		rcu_idle_enter();
+
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
@@ -241,12 +253,11 @@ static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+		if (cpu_idle_force_poll || tick_check_broadcast_expired())
 			cpu_idle_poll();
-		} else {
-			tick_nohz_idle_stop_tick();
+		else
 			cpuidle_idle_call();
-		}
+
 		arch_cpu_idle_exit();
 	}
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2018-03-12  9:53 ` [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
@ 2018-03-12  9:54 ` Rafael J. Wysocki
  2018-03-14 12:59   ` Peter Zijlstra
  2018-03-12 10:04 ` [RFT][PATCH v4 5/7] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12  9:54 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0,
 (2) the selected state is a polling one, or
 (3) the target residency of the selected state is less than the
     length of the tick period and there is at least one deeper
     idle state available within the tick period range.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/cpuidle.c          |   10 ++++++++--
 drivers/cpuidle/governors/ladder.c |    3 ++-
 drivers/cpuidle/governors/menu.c   |   32 +++++++++++++++++++++++++++++---
 include/linux/cpuidle.h            |    8 +++++---
 kernel/sched/idle.c                |    4 +++-
 5 files changed, 47 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
 				  struct cpuidle_device *dev);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev,
+			  bool *nohz_ret);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -163,7 +164,7 @@ static inline bool cpuidle_not_available
 					 struct cpuidle_device *dev)
 {return true; }
 static inline int cpuidle_select(struct cpuidle_driver *drv,
-				 struct cpuidle_device *dev)
+				 struct cpuidle_device *dev, bool *nohz_ret)
 {return -ENODEV; }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
@@ -246,7 +247,8 @@ struct cpuidle_governor {
 					struct cpuidle_device *dev);
 
 	int  (*select)		(struct cpuidle_driver *drv,
-					struct cpuidle_device *dev);
+					struct cpuidle_device *dev,
+					bool *nohz_ret);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
 };
 
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,15 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		bool nohz = true;
+
 		tick_nohz_idle_stop_tick();
 		rcu_idle_enter();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
-		next_state = cpuidle_select(drv, dev);
+		next_state = cpuidle_select(drv, dev, &nohz);
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
  *
  * Returns the index of the idle state.  The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   bool *nohz_ret)
 {
-	return cpuidle_curr_governor->select(drv, dev);
+	return cpuidle_curr_governor->select(drv, dev, nohz_ret);
 }
 
 /**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @dummy: not used
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-				struct cpuidle_device *dev)
+			       struct cpuidle_device *dev, bool *dummy)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -275,12 +275,16 @@ again:
 	goto again;
 }
 
+#define TICK_USEC_HZ   ((USEC_PER_SEC + HZ/2) / HZ)
+
 /**
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
  */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		       bool *nohz_ret)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +307,10 @@ static int menu_select(struct cpuidle_dr
 		latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
+	if (unlikely(latency_req == 0)) {
+		*nohz_ret = false;
 		return 0;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -369,7 +375,7 @@ static int menu_select(struct cpuidle_dr
 			idx = i; /* first enabled state */
 		if (s->target_residency > data->predicted_us)
 			break;
-		if (s->exit_latency > latency_req)
+	        if (s->exit_latency > latency_req)
 			break;
 
 		idx = i;
@@ -378,6 +384,26 @@ static int menu_select(struct cpuidle_dr
 	if (idx == -1)
 		idx = 0; /* No states enabled. Must use 0. */
 
+	if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+		*nohz_ret = false;
+	} else if (drv->states[idx].target_residency < TICK_USEC_HZ) {
+		first_idx = idx;
+		for (i = idx + 1; i < drv->state_count; i++)
+			if (!drv->states[i].disabled &&
+			    !dev->states_usage[i].disable) {
+				first_idx = i;
+				break;
+			}
+
+		/*
+		 * Do not stop the tick if there is at least one more state
+		 * within the tick period range that could be used if longer
+		 * idle duration was predicted.
+		 */
+		*nohz_ret = !(first_idx > idx &&
+			      drv->states[first_idx].target_residency < TICK_USEC_HZ);
+	}
+
 	data->last_state_idx = idx;
 
 	return data->last_state_idx;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 5/7] sched: idle: Select idle state before stopping the tick
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2018-03-12  9:54 ` [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
@ 2018-03-12 10:04 ` Rafael J. Wysocki
  2018-03-12 10:05 ` [RFT][PATCH v4 6/7] cpuidle: menu: Refine idle state selection for running tick Rafael J. Wysocki
  2018-03-12 10:07 ` [RFT][PATCH v4 7/7] cpuidle: menu: Avoid selecting shallow states with stopped tick Rafael J. Wysocki
  6 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12 10:04 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned
by cpuidle_select() to decide whether or not to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, notice that tick_nohz_stop_sched_tick() already computes the
next timer event time to reprogram the scheduler tick hrtimer and
that time can be used as a proxy for the actual next timer event
time in the idle duration predicition.  Moreover, it is possible
to split tick_nohz_stop_sched_tick() into two separate routines,
one computing the time to the next timer event and the other
simply stopping the tick when the time to the next timer event
is known.

Accordingly, split tick_nohz_stop_sched_tick() into
tick_nohz_next_event() and tick_nohz_stop_tick() and use the
former in tick_nohz_get_sleep_length().  Add two new extra fields,
timer_expires and timer_expires_base, to struct tick_sched for
passing data between these two new functions and to indicate that
tick_nohz_next_event() has run and tick_nohz_stop_tick() can be
called now.  Also drop the now redundant sleep_length field from
there.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v3 -> v4:
 * Drop the somewhat redundant tick_may_stop flag.
 * Rename timer_expires_basemono to timer_expires_base and use it as the
   cached state validity indicator too instead of tick_may_stop.
 * Document the new fields in struct tick_sched.
 * Drop double underscores from the names of two new functions.
 * Add WARN_ON_ONCE() to tick_nohz_idle_enter/exit() to catch leaks of the
   cached state.
 * Add WARN_ON_ONCE() to tick_nohz_get_sleep_length() to catch invocations
   of it outside of idle.

---
 include/linux/tick.h     |    2 
 kernel/sched/idle.c      |   11 ++-
 kernel/time/tick-sched.c |  156 +++++++++++++++++++++++++++++++----------------
 kernel/time/tick-sched.h |    6 +
 4 files changed, 120 insertions(+), 55 deletions(-)

Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -38,7 +38,8 @@ enum tick_nohz_mode {
  * @idle_exittime:	Time when the idle state was left
  * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
  * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length:	Duration of the current idle sleep
+ * @timer_expires:	Anticipated timer expiration time (in case sched tick is stopped)
+ * @timer_expires_base:	Base time clock monotonic for @timer_expires
  * @do_timer_lst:	CPU was the last one doing do_timer before going idle
  */
 struct tick_sched {
@@ -58,8 +59,9 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
-	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
+	u64				timer_expires;
+	u64				timer_expires_base;
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -190,13 +190,18 @@ static void cpuidle_idle_call(void)
 	} else {
 		bool nohz = true;
 
-		tick_nohz_idle_stop_tick();
-		rcu_idle_enter();
-
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
 		next_state = cpuidle_select(drv, dev, &nohz);
+
+		if (nohz)
+			tick_nohz_idle_stop_tick();
+		else
+			tick_nohz_idle_retain_tick();
+
+		rcu_idle_enter();
+
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -652,13 +652,10 @@ static inline bool local_timer_softirq_p
 	return local_softirq_pending() & TIMER_SOFTIRQ;
 }
 
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
-					 ktime_t now, int cpu)
+static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
 	unsigned long seq, basejiff;
-	ktime_t	tick;
 
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
@@ -667,6 +664,7 @@ static ktime_t tick_nohz_stop_sched_tick
 		basejiff = jiffies;
 	} while (read_seqretry(&jiffies_lock, seq));
 	ts->last_jiffies = basejiff;
+	ts->timer_expires_base = basemono;
 
 	/*
 	 * Keep the periodic tick, when RCU, architecture or irq_work
@@ -711,31 +709,24 @@ static ktime_t tick_nohz_stop_sched_tick
 		 * next period, so no point in stopping it either, bail.
 		 */
 		if (!ts->tick_stopped) {
-			tick = 0;
+			ts->timer_expires = 0;
 			goto out;
 		}
 	}
 
 	/*
-	 * If this CPU is the one which updates jiffies, then give up
-	 * the assignment and let it be taken by the CPU which runs
-	 * the tick timer next, which might be this CPU as well. If we
-	 * don't drop this here the jiffies might be stale and
-	 * do_timer() never invoked. Keep track of the fact that it
-	 * was the one which had the do_timer() duty last. If this CPU
-	 * is the one which had the do_timer() duty last, we limit the
-	 * sleep time to the timekeeping max_deferment value.
+	 * If this CPU is the one which had the do_timer() duty last, we limit
+	 * the sleep time to the timekeeping max_deferment value.
 	 * Otherwise we can sleep as long as we want.
 	 */
 	delta = timekeeping_max_deferment();
-	if (cpu == tick_do_timer_cpu) {
-		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-		ts->do_timer_last = 1;
-	} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
-		delta = KTIME_MAX;
-		ts->do_timer_last = 0;
-	} else if (!ts->do_timer_last) {
-		delta = KTIME_MAX;
+	if (cpu != tick_do_timer_cpu) {
+		if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+			delta = KTIME_MAX;
+			ts->do_timer_last = 0;
+		} else if (!ts->do_timer_last) {
+			delta = KTIME_MAX;
+		}
 	}
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -750,14 +741,40 @@ static ktime_t tick_nohz_stop_sched_tick
 	else
 		expires = KTIME_MAX;
 
-	expires = min_t(u64, expires, next_tick);
-	tick = expires;
+	ts->timer_expires = min_t(u64, expires, next_tick);
+
+out:
+	return ts->timer_expires;
+}
+
+static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
+{
+	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+	u64 basemono = ts->timer_expires_base;
+	u64 expires = ts->timer_expires;
+	ktime_t tick = expires;
+
+	/* Make sure we won't be trying to stop it twice in a row. */
+	ts->timer_expires_base = 0;
+
+	/*
+	 * If this CPU is the one which updates jiffies, then give up
+	 * the assignment and let it be taken by the CPU which runs
+	 * the tick timer next, which might be this CPU as well. If we
+	 * don't drop this here the jiffies might be stale and
+	 * do_timer() never invoked. Keep track of the fact that it
+	 * was the one which had the do_timer() duty last.
+	 */
+	if (cpu == tick_do_timer_cpu) {
+		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+		ts->do_timer_last = 1;
+	}
 
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
 		if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
-			goto out;
+			return;
 
 		WARN_ON_ONCE(1);
 		printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
@@ -791,7 +808,7 @@ static ktime_t tick_nohz_stop_sched_tick
 	if (unlikely(expires == KTIME_MAX)) {
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
 			hrtimer_cancel(&ts->sched_timer);
-		goto out;
+		return;
 	}
 
 	hrtimer_set_expires(&ts->sched_timer, tick);
@@ -800,15 +817,23 @@ static ktime_t tick_nohz_stop_sched_tick
 		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
 	else
 		tick_program_event(tick, 1);
-out:
-	/*
-	 * Update the estimated sleep length until the next timer
-	 * (not only the tick).
-	 */
-	ts->sleep_length = ktime_sub(dev->next_event, now);
-	return tick;
 }
 
+static void tick_nohz_retain_tick(struct tick_sched *ts)
+{
+	ts->timer_expires_base = 0;
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+	if (tick_nohz_next_event(ts, cpu))
+		tick_nohz_stop_tick(ts, cpu);
+	else
+		tick_nohz_retain_tick(ts);
+}
+#endif /* CONFIG_NO_HZ_FULL */
+
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
@@ -844,7 +869,7 @@ static void tick_nohz_full_update_tick(s
 		return;
 
 	if (can_stop_full_tick(cpu, ts))
-		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+		tick_nohz_stop_sched_tick(ts, cpu);
 	else if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick(ts, ktime_get());
 #endif
@@ -870,10 +895,8 @@ static bool can_stop_idle_tick(int cpu,
 		return false;
 	}
 
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
-		ts->sleep_length = NSEC_PER_SEC / HZ;
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
 		return false;
-	}
 
 	if (need_resched())
 		return false;
@@ -913,25 +936,33 @@ static void __tick_nohz_idle_stop_tick(s
 	ktime_t expires;
 	int cpu = smp_processor_id();
 
-	if (can_stop_idle_tick(cpu, ts)) {
+	/*
+	 * If tick_nohz_get_sleep_length() ran tick_nohz_next_event(), the
+	 * tick timer expiration time is known already.
+	 */
+	if (ts->timer_expires_base)
+		expires = ts->timer_expires;
+	else if (can_stop_idle_tick(cpu, ts))
+		expires = tick_nohz_next_event(ts, cpu);
+	else
+		return;
+
+	ts->idle_calls++;
+
+	if (expires > 0LL) {
 		int was_stopped = ts->tick_stopped;
 
-		ts->idle_calls++;
+		tick_nohz_stop_tick(ts, cpu);
 
-		/*
-		 * The idle entry time should be a sufficient approximation of
-		 * the current time at this point.
-		 */
-		expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
-		if (expires > 0LL) {
-			ts->idle_sleeps++;
-			ts->idle_expires = expires;
-		}
+		ts->idle_sleeps++;
+		ts->idle_expires = expires;
 
 		if (!was_stopped && ts->tick_stopped) {
 			ts->idle_jiffies = ts->last_jiffies;
 			nohz_balance_enter_idle(cpu);
 		}
+	} else {
+		tick_nohz_retain_tick(ts);
 	}
 }
 
@@ -945,6 +976,11 @@ void tick_nohz_idle_stop_tick(void)
 	__tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
 }
 
+void tick_nohz_idle_retain_tick(void)
+{
+	tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 /**
  * tick_nohz_idle_enter - prepare for entering idle on the current CPU
  *
@@ -957,7 +993,7 @@ void tick_nohz_idle_enter(void)
 	lockdep_assert_irqs_enabled();
 	/*
 	 * Update the idle state in the scheduler domain hierarchy
-	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+	 * when tick_nohz_stop_tick() is called from the idle loop.
 	 * State will be updated to busy during the first busy tick after
 	 * exiting idle.
 	 */
@@ -966,6 +1002,9 @@ void tick_nohz_idle_enter(void)
 	local_irq_disable();
 
 	ts = this_cpu_ptr(&tick_cpu_sched);
+
+	WARN_ON_ONCE(ts->timer_expires_base);
+
 	ts->inidle = 1;
 	tick_nohz_start_idle(ts);
 
@@ -991,15 +1030,31 @@ void tick_nohz_irq_exit(void)
 }
 
 /**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
  *
  * Called from power state control code with interrupts disabled
  */
 ktime_t tick_nohz_get_sleep_length(void)
 {
+	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+	int cpu = smp_processor_id();
+	/*
+	 * The idle entry time is expected to be a sufficient approximation of
+	 * the current time at this point.
+	 */
+	ktime_t now = ts->idle_entrytime;
+
+	WARN_ON_ONCE(!ts->inidle);
+
+	if (can_stop_idle_tick(cpu, ts)) {
+		ktime_t next_event = tick_nohz_next_event(ts, cpu);
+
+		if (next_event)
+			return ktime_sub(next_event, now);
+	}
 
-	return ts->sleep_length;
+	return ktime_sub(dev->next_event, now);
 }
 
 /**
@@ -1063,6 +1118,7 @@ void tick_nohz_idle_exit(void)
 	local_irq_disable();
 
 	WARN_ON_ONCE(!ts->inidle);
+	WARN_ON_ONCE(ts->timer_expires_base);
 
 	ts->inidle = 0;
 
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -115,6 +115,7 @@ enum tick_dep_bits {
 extern bool tick_nohz_enabled;
 extern int tick_nohz_tick_stopped(void);
 extern void tick_nohz_idle_stop_tick(void);
+extern void tick_nohz_idle_retain_tick(void);
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
@@ -127,6 +128,7 @@ extern u64 get_cpu_iowait_time_us(int cp
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
 static inline void tick_nohz_idle_stop_tick(void) { }
+static inline void tick_nohz_idle_retain_tick(void) { }
 static inline void tick_nohz_idle_enter(void) { }
 static inline void tick_nohz_idle_exit(void) { }
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 6/7] cpuidle: menu: Refine idle state selection for running tick
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2018-03-12 10:04 ` [RFT][PATCH v4 5/7] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
@ 2018-03-12 10:05 ` Rafael J. Wysocki
  2018-03-12 10:07 ` [RFT][PATCH v4 7/7] cpuidle: menu: Avoid selecting shallow states with stopped tick Rafael J. Wysocki
  6 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12 10:05 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.

To avoid that, make tick_nohz_get_sleep_length() return the current
time the the next event (before stopping the tick) in addition to
the estimated one via an extra pointer argument and make
menu_select() use that value to refine the state selection
when necessary.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   21 ++++++++++++++++++---
 include/linux/tick.h             |    2 +-
 kernel/time/tick-sched.c         |    7 +++++--
 3 files changed, 24 insertions(+), 6 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -119,7 +119,7 @@ extern void tick_nohz_idle_retain_tick(v
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
-extern ktime_t tick_nohz_get_sleep_length(void);
+extern ktime_t tick_nohz_get_sleep_length(ktime_t *cur_ret);
 extern unsigned long tick_nohz_get_idle_calls(void);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1031,10 +1031,11 @@ void tick_nohz_irq_exit(void)
 
 /**
  * tick_nohz_get_sleep_length - return the expected length of the current sleep
+ * @cur_ret: pointer for returning the current time to the next event
  *
  * Called from power state control code with interrupts disabled
  */
-ktime_t tick_nohz_get_sleep_length(void)
+ktime_t tick_nohz_get_sleep_length(ktime_t *cur_ret)
 {
 	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
@@ -1047,6 +1048,8 @@ ktime_t tick_nohz_get_sleep_length(void)
 
 	WARN_ON_ONCE(!ts->inidle);
 
+	*cur_ret = ktime_sub(dev->next_event, now);
+
 	if (can_stop_idle_tick(cpu, ts)) {
 		ktime_t next_event = tick_nohz_next_event(ts, cpu);
 
@@ -1054,7 +1057,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 			return ktime_sub(next_event, now);
 	}
 
-	return ktime_sub(dev->next_event, now);
+	return *cur_ret;
 }
 
 /**
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -296,6 +296,7 @@ static int menu_select(struct cpuidle_dr
 	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 	int resume_latency = dev_pm_qos_raw_read_value(device);
+	ktime_t tick_time;
 
 	if (data->needs_update) {
 		menu_update(drv, dev);
@@ -313,7 +314,7 @@ static int menu_select(struct cpuidle_dr
 	}
 
 	/* determine the expected residency time, round up */
-	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&tick_time));
 
 	get_iowait_load(&nr_iowaiters, &cpu_load);
 	data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -400,8 +401,22 @@ static int menu_select(struct cpuidle_dr
 		 * within the tick period range that could be used if longer
 		 * idle duration was predicted.
 		 */
-		*nohz_ret = !(first_idx > idx &&
-			      drv->states[first_idx].target_residency < TICK_USEC_HZ);
+		if (first_idx > idx &&
+		    drv->states[first_idx].target_residency < TICK_USEC_HZ) {
+			unsigned int tick_us = ktime_to_us(tick_time);
+
+			/*
+			 * Find a state with target residency less than the
+			 * time to the next timer event including the tick.
+			 */
+			while (idx > 0 &&
+			    (drv->states[idx].target_residency > tick_us ||
+			     drv->states[idx].disabled ||
+			     dev->states_usage[idx].disable))
+				idx--;
+
+			*nohz_ret = false;
+		}
 	}
 
 	data->last_state_idx = idx;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT][PATCH v4 7/7] cpuidle: menu: Avoid selecting shallow states with stopped tick
  2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2018-03-12 10:05 ` [RFT][PATCH v4 6/7] cpuidle: menu: Refine idle state selection for running tick Rafael J. Wysocki
@ 2018-03-12 10:07 ` Rafael J. Wysocki
  6 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-12 10:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM, Frederic Weisbecker
  Cc: Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU may spend a long time in that
state if the selection is based on an inaccurate prediction of idle
period duration.  That effect turns out to occur relatively often,
so it needs to be mitigated.

To that end, modify the menu governor to discard the result of the
idle period duration prediction if it is less than the tick period
duration and the tick is stopped, unless the tick timer is going to
expire soon.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -297,6 +297,7 @@ static int menu_select(struct cpuidle_dr
 	unsigned long nr_iowaiters, cpu_load;
 	int resume_latency = dev_pm_qos_raw_read_value(device);
 	ktime_t tick_time;
+	unsigned int tick_us;
 
 	if (data->needs_update) {
 		menu_update(drv, dev);
@@ -315,6 +316,7 @@ static int menu_select(struct cpuidle_dr
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&tick_time));
+	tick_us = ktime_to_us(tick_time);
 
 	get_iowait_load(&nr_iowaiters, &cpu_load);
 	data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -354,12 +356,24 @@ static int menu_select(struct cpuidle_dr
 	data->predicted_us = min(data->predicted_us, expected_interval);
 
 	/*
-	 * Use the performance multiplier and the user-configurable
-	 * latency_req to determine the maximum exit latency.
+	 * If the tick is already stopped, the cost of possible misprediction is
+	 * much higher, because the CPU may be stuck in a shallow idle state for
+	 * a long time as a result of it.  For this reason, if that happens say
+	 * we might mispredict and try to force the CPU into a state for which
+	 * we would have stopped the tick, unless the tick timer is going to
+	 * expire really soon anyway.
 	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
-	if (latency_req > interactivity_req)
-		latency_req = interactivity_req;
+	if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC_HZ) {
+		data->predicted_us = min_t(unsigned int, TICK_USEC_HZ, tick_us);
+	} else {
+		/*
+		 * Use the performance multiplier and the user-configurable
+		 * latency_req to determine the maximum exit latency.
+		 */
+		interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+		if (latency_req > interactivity_req)
+			latency_req = interactivity_req;
+	}
 
 	/*
 	 * Find the idle state with the lowest power while satisfying
@@ -403,8 +417,6 @@ static int menu_select(struct cpuidle_dr
 		 */
 		if (first_idx > idx &&
 		    drv->states[first_idx].target_residency < TICK_USEC_HZ) {
-			unsigned int tick_us = ktime_to_us(tick_time);
-
 			/*
 			 * Find a state with target residency less than the
 			 * time to the next timer event including the tick.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-12  9:54 ` [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
@ 2018-03-14 12:59   ` Peter Zijlstra
  2018-03-15 12:54     ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-03-14 12:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Frederic Weisbecker, Thomas Gleixner, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

On Mon, Mar 12, 2018 at 10:54:18AM +0100, Rafael J. Wysocki wrote:
> @@ -378,6 +384,26 @@ static int menu_select(struct cpuidle_dr
>  	if (idx == -1)
>  		idx = 0; /* No states enabled. Must use 0. */
>  
> +	if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
> +		*nohz_ret = false;
> +	} else if (drv->states[idx].target_residency < TICK_USEC_HZ) {
> +		first_idx = idx;
> +		for (i = idx + 1; i < drv->state_count; i++) {
> +			if (!drv->states[i].disabled &&
> +			    !dev->states_usage[i].disable) {
> +				first_idx = i;
> +				break;
> +			}
		}
> +
> +		/*
> +		 * Do not stop the tick if there is at least one more state
> +		 * within the tick period range that could be used if longer
> +		 * idle duration was predicted.
> +		 */
> +		*nohz_ret = !(first_idx > idx &&
> +			      drv->states[first_idx].target_residency < TICK_USEC_HZ);


Why though? That comment only states what it does, but gives no clue as
to why we're doing this. What was wrong with disabling NOHZ if the
selected state (@idx) has shorter target residency.


> +	}
> +
>  	data->last_state_idx = idx;
>  
>  	return data->last_state_idx;
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-12  9:47 ` [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-14 15:49   ` Frederic Weisbecker
  2018-03-14 17:20     ` Peter Zijlstra
  2018-03-15 12:33     ` Rafael J. Wysocki
  0 siblings, 2 replies; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-14 15:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Frederic Weisbecker, Thomas Gleixner,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Mon, Mar 12, 2018 at 10:47:41AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Prepare the scheduler tick code for reworking the idle loop to
> avoid stopping the tick in some cases.
> 
> Move away the tick_nohz_start_idle() invocation from
> __tick_nohz_idle_enter(), rename the latter to
> __tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
> as a wrapper around it for calling it from the outside.
> 
> Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
> of calling the entire __tick_nohz_idle_enter(), add another wrapper
> disabling and enabling interrupts around tick_nohz_idle_stop_tick()
> and make the current callers of tick_nohz_idle_enter() call it too
> to retain their current functionality.

Perhaps we should have a higher level description of what the patch does.
After all the low level part is already described in the diff.

Ie: we are splitting the nohz idle entry call to decouple the idle time
stats accounting and preparatory work from the actual tick stop code, that
in order to later be able to delay the tick stop once we reach more
power-knowledgeable callers.

> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/xen/smp_pv.c    |    1 +
>  include/linux/tick.h     |    9 +++++++++
>  kernel/sched/idle.c      |    1 +
>  kernel/time/tick-sched.c |   46 +++++++++++++++++++++++++---------------------
>  4 files changed, 36 insertions(+), 21 deletions(-)
> 
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -114,6 +114,7 @@ enum tick_dep_bits {
>  #ifdef CONFIG_NO_HZ_COMMON
>  extern bool tick_nohz_enabled;
>  extern int tick_nohz_tick_stopped(void);
> +extern void tick_nohz_idle_stop_tick(void);
>  extern void tick_nohz_idle_enter(void);
>  extern void tick_nohz_idle_exit(void);
>  extern void tick_nohz_irq_exit(void);
> @@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cp
>  #else /* !CONFIG_NO_HZ_COMMON */
>  #define tick_nohz_enabled (0)
>  static inline int tick_nohz_tick_stopped(void) { return 0; }
> +static inline void tick_nohz_idle_stop_tick(void) { }
>  static inline void tick_nohz_idle_enter(void) { }
>  static inline void tick_nohz_idle_exit(void) { }
>  
> @@ -136,6 +138,13 @@ static inline u64 get_cpu_idle_time_us(i
>  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
>  #endif /* !CONFIG_NO_HZ_COMMON */
>  
> +static inline void tick_nohz_idle_stop_tick_protected(void)
> +{
> +	local_irq_disable();
> +	tick_nohz_idle_stop_tick();
> +	local_irq_enable();
> +}

It seems that even if we have CONFIG_NO_HZ_COMMON=n, tick_nohz_idle_stop_tick_protected()
will have overhead, right?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-14 15:49   ` Frederic Weisbecker
@ 2018-03-14 17:20     ` Peter Zijlstra
  2018-03-15 17:26       ` Frederic Weisbecker
  2018-03-15 12:33     ` Rafael J. Wysocki
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-03-14 17:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Wed, Mar 14, 2018 at 04:49:39PM +0100, Frederic Weisbecker wrote:
> On Mon, Mar 12, 2018 at 10:47:41AM +0100, Rafael J. Wysocki wrote:

> > @@ -136,6 +138,13 @@ static inline u64 get_cpu_idle_time_us(i
> >  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> >  #endif /* !CONFIG_NO_HZ_COMMON */
> >  
> > +static inline void tick_nohz_idle_stop_tick_protected(void)
> > +{
> > +	local_irq_disable();
> > +	tick_nohz_idle_stop_tick();
> > +	local_irq_enable();
> > +}
> 
> It seems that even if we have CONFIG_NO_HZ_COMMON=n,
> tick_nohz_idle_stop_tick_protected() will have overhead, right?

IIRC the only callsite of the _protected thing that remains at the end
is in the hotplug path. So who cares ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-14 15:49   ` Frederic Weisbecker
  2018-03-14 17:20     ` Peter Zijlstra
@ 2018-03-15 12:33     ` Rafael J. Wysocki
  1 sibling, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-15 12:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Linux PM, Frederic Weisbecker, Thomas Gleixner,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Wednesday, March 14, 2018 4:49:39 PM CET Frederic Weisbecker wrote:
> On Mon, Mar 12, 2018 at 10:47:41AM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Prepare the scheduler tick code for reworking the idle loop to
> > avoid stopping the tick in some cases.
> > 
> > Move away the tick_nohz_start_idle() invocation from
> > __tick_nohz_idle_enter(), rename the latter to
> > __tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
> > as a wrapper around it for calling it from the outside.
> > 
> > Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
> > of calling the entire __tick_nohz_idle_enter(), add another wrapper
> > disabling and enabling interrupts around tick_nohz_idle_stop_tick()
> > and make the current callers of tick_nohz_idle_enter() call it too
> > to retain their current functionality.
> 
> Perhaps we should have a higher level description of what the patch does.
> After all the low level part is already described in the diff.
> 
> Ie: we are splitting the nohz idle entry call to decouple the idle time
> stats accounting and preparatory work from the actual tick stop code, that
> in order to later be able to delay the tick stop once we reach more
> power-knowledgeable callers.

OK

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-14 12:59   ` Peter Zijlstra
@ 2018-03-15 12:54     ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-15 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM, Frederic Weisbecker, Thomas Gleixner, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

On Wednesday, March 14, 2018 1:59:29 PM CET Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 10:54:18AM +0100, Rafael J. Wysocki wrote:
> > @@ -378,6 +384,26 @@ static int menu_select(struct cpuidle_dr
> >  	if (idx == -1)
> >  		idx = 0; /* No states enabled. Must use 0. */
> >  
> > +	if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
> > +		*nohz_ret = false;
> > +	} else if (drv->states[idx].target_residency < TICK_USEC_HZ) {
> > +		first_idx = idx;
> > +		for (i = idx + 1; i < drv->state_count; i++) {
> > +			if (!drv->states[i].disabled &&
> > +			    !dev->states_usage[i].disable) {
> > +				first_idx = i;
> > +				break;
> > +			}
> 		}
> > +
> > +		/*
> > +		 * Do not stop the tick if there is at least one more state
> > +		 * within the tick period range that could be used if longer
> > +		 * idle duration was predicted.
> > +		 */
> > +		*nohz_ret = !(first_idx > idx &&
> > +			      drv->states[first_idx].target_residency < TICK_USEC_HZ);
> 
> 
> Why though? That comment only states what it does, but gives no clue as
> to why we're doing this. What was wrong with disabling NOHZ if the
> selected state (@idx) has shorter target residency.
> 
> 
> > +	}
> > +
> >  	data->last_state_idx = idx;
> >  
> >  	return data->last_state_idx;
> > 
> 

I invented this complicated logic because of the concern that using
predicted_us alone may skew the prediction algotithm too much towards
the lower values, so the idea was (roughly) to allow CPUs to be idle
for longer times more aggressively.  That is, to stop the tick even
in some cases in which predicted_us was below the tick period length,
but then again not too often.

It appears to work, but you are right that it is confusing and on
a second thought simpler code is always better as long as it gets the
job done.

I'll rework this and resend the series later today.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-12  9:51 ` [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
@ 2018-03-15 16:10   ` Frederic Weisbecker
  2018-03-15 16:50     ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-15 16:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Frederic Weisbecker, Thomas Gleixner,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Mon, Mar 12, 2018 at 10:51:11AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Push the decision whether or not to stop the tick somewhat deeper
> into the idle loop.
> 
> Stopping the tick upfront leads to unpleasant outcomes in case the
> idle governor doesn't agree with the timekeeping code on the duration
> of the upcoming idle period.

Looks like you meant "nohz" instead of "timekeeping"?


> Specifically, if the tick has been
> stopped and the idle governor predicts short idle, the situation is
> bad regardless of whether or not the prediction is accurate.  If it
> is accurate, the tick has been stopped unnecessarily which means
> excessive overhead.  If it is not accurate, the CPU is likely to
> spend too much time in the (shallow, because short idle has been
> predicted) idle state selected by the governor [1].
> 
> As the first step towards addressing this problem, change the code
> to make the tick stopping decision inside of the loop in do_idle().
> In particular, do not stop the tick in the cpu_idle_poll() code path.
> Also don't do that in tick_nohz_irq_exit() which doesn't really have
> enough information on whether or not to stop the tick.
> 
> Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
> Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  kernel/sched/idle.c      |    8 +++++---
>  kernel/time/tick-sched.c |    6 ++----
>  2 files changed, 7 insertions(+), 7 deletions(-)
> 
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -241,10 +241,12 @@ static void do_idle(void)
>  		 * broadcast device expired for us, we don't want to go deep
>  		 * idle as we know that the IPI is going to arrive right away.
>  		 */
> -		if (cpu_idle_force_poll || tick_check_broadcast_expired())
> +		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>  			cpu_idle_poll();
> -		else
> +		} else {
> +			tick_nohz_idle_stop_tick();
>  			cpuidle_idle_call();
> +		}

I'm worried about one thing here. Say we enter cpuidle_idle_call() and the tick is stopped.
Later on, we get a tick, so we exit cpuidle_idle_call(), then we find cpu_idle_force_poll
or tick_check_broadcast_expired() to be true. So we poll but the tick hasn't been updated
to fire again.

I don't know if it can happen but cpu_idle_poll_ctrl() seem to be callable anytime.
It looks like it's only used on __init code or on power suspend/resume, not sure about
the implications on the latter, still there could be further misuse in the future.

Concerning tick_check_broadcast_expired(), it's hard to tell if it can be enabled
concurrently from another CPU or from interrupts.

Anyway perhaps we should have, out of paranoia:

+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			tick_nohz_idle_restart_tick();
  			cpu_idle_poll();
-		else

...where tick_nohz_idle_restart_tick() would be:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 29a5733..9ae1ef5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1046,6 +1046,18 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 #endif
 }
 
+static void __tick_nohz_idle_restart_tick(struct tick_sched *ts, ktime_t now)
+{
+	tick_nohz_restart_sched_tick(ts, now);
+	tick_nohz_account_idle_ticks(ts);
+}
+
+void tick_nohz_idle_restart_tick(void)
+{
+	if (ts->tick_stopped)
+		__tick_nohz_idle_restart_tick(this_cpu_ptr(&tick_cpu_sched), ktime_get());
+}
+
 /**
  * tick_nohz_idle_exit - restart the idle tick from the idle task
  *
@@ -1070,10 +1082,8 @@ void tick_nohz_idle_exit(void)
 	if (ts->idle_active)
 		tick_nohz_stop_idle(ts, now);
 
-	if (ts->tick_stopped) {
-		tick_nohz_restart_sched_tick(ts, now);
-		tick_nohz_account_idle_ticks(ts);
-	}
+	if (ts->tick_stopped())
+		__tick_nohz_idle_restart_tick(ts, now)
 
 	local_irq_enable();
 }


Thanks.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-15 16:10   ` Frederic Weisbecker
@ 2018-03-15 16:50     ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-15 16:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 15, 2018 at 5:10 PM, Frederic Weisbecker
<frederic@kernel.org> wrote:
> On Mon, Mar 12, 2018 at 10:51:11AM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Push the decision whether or not to stop the tick somewhat deeper
>> into the idle loop.
>>
>> Stopping the tick upfront leads to unpleasant outcomes in case the
>> idle governor doesn't agree with the timekeeping code on the duration
>> of the upcoming idle period.
>
> Looks like you meant "nohz" instead of "timekeeping"?

Yes, I did.

>> Specifically, if the tick has been
>> stopped and the idle governor predicts short idle, the situation is
>> bad regardless of whether or not the prediction is accurate.  If it
>> is accurate, the tick has been stopped unnecessarily which means
>> excessive overhead.  If it is not accurate, the CPU is likely to
>> spend too much time in the (shallow, because short idle has been
>> predicted) idle state selected by the governor [1].
>>
>> As the first step towards addressing this problem, change the code
>> to make the tick stopping decision inside of the loop in do_idle().
>> In particular, do not stop the tick in the cpu_idle_poll() code path.
>> Also don't do that in tick_nohz_irq_exit() which doesn't really have
>> enough information on whether or not to stop the tick.
>>
>> Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
>> Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  kernel/sched/idle.c      |    8 +++++---
>>  kernel/time/tick-sched.c |    6 ++----
>>  2 files changed, 7 insertions(+), 7 deletions(-)
>>
>> Index: linux-pm/kernel/sched/idle.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/idle.c
>> +++ linux-pm/kernel/sched/idle.c
>> @@ -241,10 +241,12 @@ static void do_idle(void)
>>                * broadcast device expired for us, we don't want to go deep
>>                * idle as we know that the IPI is going to arrive right away.
>>                */
>> -             if (cpu_idle_force_poll || tick_check_broadcast_expired())
>> +             if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>>                       cpu_idle_poll();
>> -             else
>> +             } else {
>> +                     tick_nohz_idle_stop_tick();
>>                       cpuidle_idle_call();
>> +             }
>
> I'm worried about one thing here. Say we enter cpuidle_idle_call() and the tick is stopped.
> Later on, we get a tick, so we exit cpuidle_idle_call(), then we find cpu_idle_force_poll
> or tick_check_broadcast_expired() to be true. So we poll but the tick hasn't been updated
> to fire again.
>
> I don't know if it can happen but cpu_idle_poll_ctrl() seem to be callable anytime.
> It looks like it's only used on __init code or on power suspend/resume, not sure about
> the implications on the latter, still there could be further misuse in the future.
>
> Concerning tick_check_broadcast_expired(), it's hard to tell if it can be enabled
> concurrently from another CPU or from interrupts.
>
> Anyway perhaps we should have, out of paranoia:
>
> +               if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> +                       tick_nohz_idle_restart_tick();
>                         cpu_idle_poll();
> -               else
>

Agreed, I'll update the patch accordingly.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-14 17:20     ` Peter Zijlstra
@ 2018-03-15 17:26       ` Frederic Weisbecker
  0 siblings, 0 replies; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-15 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Wed, Mar 14, 2018 at 06:20:16PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 04:49:39PM +0100, Frederic Weisbecker wrote:
> > On Mon, Mar 12, 2018 at 10:47:41AM +0100, Rafael J. Wysocki wrote:
> 
> > > @@ -136,6 +138,13 @@ static inline u64 get_cpu_idle_time_us(i
> > >  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> > >  #endif /* !CONFIG_NO_HZ_COMMON */
> > >  
> > > +static inline void tick_nohz_idle_stop_tick_protected(void)
> > > +{
> > > +	local_irq_disable();
> > > +	tick_nohz_idle_stop_tick();
> > > +	local_irq_enable();
> > > +}
> > 
> > It seems that even if we have CONFIG_NO_HZ_COMMON=n,
> > tick_nohz_idle_stop_tick_protected() will have overhead, right?
> 
> IIRC the only callsite of the _protected thing that remains at the end
> is in the hotplug path. So who cares ;-)

Hey, I see you trying to sneak in overhead out of hatred against the hotplug
code ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-12  9:53 ` [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
@ 2018-03-15 18:19   ` Frederic Weisbecker
  2018-03-15 20:41     ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-15 18:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Frederic Weisbecker, Thomas Gleixner,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Mon, Mar 12, 2018 at 10:53:25AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make cpuidle_idle_call() decide whether or not to stop the tick.
> 
> First, the cpuidle_enter_s2idle() path deals with the tick (and with
> the entire timekeeping for that matter) by itself and it doesn't need
> the tick to be stopped beforehand.

Not sure you meant timekeeping either :)

>  	if (idle_should_enter_s2idle() || dev->use_deepest_state) {
>  		if (idle_should_enter_s2idle()) {
> +			rcu_idle_enter();
> +
>  			entered_state = cpuidle_enter_s2idle(drv, dev);
>  			if (entered_state > 0) {
>  				local_irq_enable();
>  				goto exit_idle;
>  			}
> +
> +			rcu_idle_exit();
>  		}

I'm not sure how the tick is stopped on suspend to idle. Perhaps through
hrtimer (tick_cancel_sched_timer()) or clockevents code.

But we may have a similar problem than with idle_poll() called right after
call_cpuidle(). Ie: we arrive in cpuidle_enter_s2idle() with a tick that
should be reprogrammed while it is not. No idea if that can hurt somehow.

I guess it depends what happens to the tick on s2idle, I'm not clear with that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-15 18:19   ` Frederic Weisbecker
@ 2018-03-15 20:41     ` Rafael J. Wysocki
  2018-03-15 21:12       ` Rafael J. Wysocki
  2018-03-16 14:16       ` Frederic Weisbecker
  0 siblings, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-15 20:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 15, 2018 at 7:19 PM, Frederic Weisbecker
<frederic@kernel.org> wrote:
> On Mon, Mar 12, 2018 at 10:53:25AM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Make cpuidle_idle_call() decide whether or not to stop the tick.
>>
>> First, the cpuidle_enter_s2idle() path deals with the tick (and with
>> the entire timekeeping for that matter) by itself and it doesn't need
>> the tick to be stopped beforehand.
>
> Not sure you meant timekeeping either :)

Yeah, I meant nohz.

>>       if (idle_should_enter_s2idle() || dev->use_deepest_state) {
>>               if (idle_should_enter_s2idle()) {
>> +                     rcu_idle_enter();
>> +
>>                       entered_state = cpuidle_enter_s2idle(drv, dev);
>>                       if (entered_state > 0) {
>>                               local_irq_enable();
>>                               goto exit_idle;
>>                       }
>> +
>> +                     rcu_idle_exit();
>>               }
>
> I'm not sure how the tick is stopped on suspend to idle. Perhaps through
> hrtimer (tick_cancel_sched_timer()) or clockevents code.

The latter.

It does clockevents_shutdown() down the road, eventually.

IOW, it couldn't care less. :-)

> But we may have a similar problem than with idle_poll() called right after
> call_cpuidle(). Ie: we arrive in cpuidle_enter_s2idle() with a tick that
> should be reprogrammed while it is not. No idea if that can hurt somehow.
>
> I guess it depends what happens to the tick on s2idle, I'm not clear with that.

No problem there, AFAICS.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-15 20:41     ` Rafael J. Wysocki
@ 2018-03-15 21:12       ` Rafael J. Wysocki
  2018-03-16 14:17         ` Frederic Weisbecker
  2018-03-16 14:16       ` Frederic Weisbecker
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2018-03-15 21:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 15, 2018 at 9:41 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Mar 15, 2018 at 7:19 PM, Frederic Weisbecker
> <frederic@kernel.org> wrote:
>> On Mon, Mar 12, 2018 at 10:53:25AM +0100, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Make cpuidle_idle_call() decide whether or not to stop the tick.
>>>
>>> First, the cpuidle_enter_s2idle() path deals with the tick (and with
>>> the entire timekeeping for that matter) by itself and it doesn't need
>>> the tick to be stopped beforehand.
>>
>> Not sure you meant timekeeping either :)
>
> Yeah, I meant nohz.

Well, not really. :-)

It is the entire timekeeping this time, as it can be suspended
entirely in that code path.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-15 20:41     ` Rafael J. Wysocki
  2018-03-15 21:12       ` Rafael J. Wysocki
@ 2018-03-16 14:16       ` Frederic Weisbecker
  1 sibling, 0 replies; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-16 14:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 15, 2018 at 09:41:10PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 15, 2018 at 7:19 PM, Frederic Weisbecker
> <frederic@kernel.org> wrote:
> > On Mon, Mar 12, 2018 at 10:53:25AM +0100, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Make cpuidle_idle_call() decide whether or not to stop the tick.
> >>
> >> First, the cpuidle_enter_s2idle() path deals with the tick (and with
> >> the entire timekeeping for that matter) by itself and it doesn't need
> >> the tick to be stopped beforehand.
> >
> > Not sure you meant timekeeping either :)
> 
> Yeah, I meant nohz.
> 
> >>       if (idle_should_enter_s2idle() || dev->use_deepest_state) {
> >>               if (idle_should_enter_s2idle()) {
> >> +                     rcu_idle_enter();
> >> +
> >>                       entered_state = cpuidle_enter_s2idle(drv, dev);
> >>                       if (entered_state > 0) {
> >>                               local_irq_enable();
> >>                               goto exit_idle;
> >>                       }
> >> +
> >> +                     rcu_idle_exit();
> >>               }
> >
> > I'm not sure how the tick is stopped on suspend to idle. Perhaps through
> > hrtimer (tick_cancel_sched_timer()) or clockevents code.
> 
> The latter.
> 
> It does clockevents_shutdown() down the road, eventually.

Ah good. And I see tick_resume_oneshot() takes care of restoring if necessary.

> 
> IOW, it couldn't care less. :-)
> 
> > But we may have a similar problem than with idle_poll() called right after
> > call_cpuidle(). Ie: we arrive in cpuidle_enter_s2idle() with a tick that
> > should be reprogrammed while it is not. No idea if that can hurt somehow.
> >
> > I guess it depends what happens to the tick on s2idle, I'm not clear with that.
> 
> No problem there, AFAICS.

Yep, all good.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-15 21:12       ` Rafael J. Wysocki
@ 2018-03-16 14:17         ` Frederic Weisbecker
  0 siblings, 0 replies; 21+ messages in thread
From: Frederic Weisbecker @ 2018-03-16 14:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Frederic Weisbecker,
	Thomas Gleixner, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 15, 2018 at 10:12:57PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 15, 2018 at 9:41 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Thu, Mar 15, 2018 at 7:19 PM, Frederic Weisbecker
> > <frederic@kernel.org> wrote:
> >> On Mon, Mar 12, 2018 at 10:53:25AM +0100, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Make cpuidle_idle_call() decide whether or not to stop the tick.
> >>>
> >>> First, the cpuidle_enter_s2idle() path deals with the tick (and with
> >>> the entire timekeeping for that matter) by itself and it doesn't need
> >>> the tick to be stopped beforehand.
> >>
> >> Not sure you meant timekeeping either :)
> >
> > Yeah, I meant nohz.
> 
> Well, not really. :-)
> 
> It is the entire timekeeping this time, as it can be suspended
> entirely in that code path.

Fair point.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-03-16 14:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-12  9:46 [RFT][PATCH v4 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
2018-03-12  9:47 ` [RFT][PATCH v4 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
2018-03-14 15:49   ` Frederic Weisbecker
2018-03-14 17:20     ` Peter Zijlstra
2018-03-15 17:26       ` Frederic Weisbecker
2018-03-15 12:33     ` Rafael J. Wysocki
2018-03-12  9:51 ` [RFT][PATCH v4 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
2018-03-15 16:10   ` Frederic Weisbecker
2018-03-15 16:50     ` Rafael J. Wysocki
2018-03-12  9:53 ` [RFT][PATCH v4 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
2018-03-15 18:19   ` Frederic Weisbecker
2018-03-15 20:41     ` Rafael J. Wysocki
2018-03-15 21:12       ` Rafael J. Wysocki
2018-03-16 14:17         ` Frederic Weisbecker
2018-03-16 14:16       ` Frederic Weisbecker
2018-03-12  9:54 ` [RFT][PATCH v4 4/7] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
2018-03-14 12:59   ` Peter Zijlstra
2018-03-15 12:54     ` Rafael J. Wysocki
2018-03-12 10:04 ` [RFT][PATCH v4 5/7] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
2018-03-12 10:05 ` [RFT][PATCH v4 6/7] cpuidle: menu: Refine idle state selection for running tick Rafael J. Wysocki
2018-03-12 10:07 ` [RFT][PATCH v4 7/7] cpuidle: menu: Avoid selecting shallow states with stopped tick Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).