linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework
@ 2018-03-04 22:21 Rafael J. Wysocki
  2018-03-04 22:24 ` [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:21 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

Hi All,

The problem is that if we stop the sched tick in
tick_nohz_idle_enter() and then the idle governor predicts short idle
duration, we lose regardless of whether or not it is right.

If it is right, we've lost already, because we stopped the tick
unnecessarily.  If it is not right, we'll lose going forward, because
the idle state selected by the governor is going to be too shallow and
we'll draw too much power (that has been reported recently to actually
happen often enough for people to care).

This patch series is an attempt to improve the situation and the idea
here is to make the decision whether or not to stop the tick deeper in
the idle loop and in particular after running the idle state selection
in the path where the idle governor is invoked.  This way the problem
can be avoided, because the idle duration predicted by the idle governor
can be used to decide whether or not to stop the tick so that the tick
is only stopped if that value is large enough (and, consequently, the
idle state selected by the governor is deep enough).

The series tires to avoid adding too much new code, rather reorder the
existing code and make it more fine-grained.

Patch 1 prepares the tick-sched code for the subsequent modifications and it
doesn't change the code's functionality (at least not intentionally).

Patch 2 starts pushing the tick stopping decision deeper into the idle
loop, but it is limited to do_idle() and tick_nohz_irq_exit().

Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
and sets the stage for the changes in patch 6.

Patch 4 splits menu_select() into idle duration prediction and idle state
selection parts.

Patch 5 adds the ->predict callback to struct cpuidle_governor and hooks
it up to the menu governor's idle duration prediction routine introduced
by patch 4.  It also changes cpuidle_select() to return the expected idle
duration in addition to the target state index.

Patch 6 reorders idle duration prediction by the governor and idle state
selection with respect to the stopping of the tick and causes the predicted
idle duration to be used for deciding whether or not to stop the tick.

Patch 7 cleans up the code to avoid running one piece of it twice in a row
in some cases.

I have tested these patches on a couple of machines, including the very laptop
I'm sending them from, without any obvious issues, but please give them a go
if you can, especially if you have an easy way to reproduce the problem they
are targeting.  The patches are on top of 4.16-rc3 (if you need a git branch
with them for easier testing, please let me know).

The above said, this is just RFC, so no pets close to the machines running it,
please, and I'm kind of expecting Peter and Thomas to tear it into pieces. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
@ 2018-03-04 22:24 ` Rafael J. Wysocki
  2018-03-05 10:44   ` Peter Zijlstra
  2018-03-04 22:24 ` [RFC/RFT][PATCH 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Prepare two pieces of code in tick_nohz_idle_enter() for being
called separately from each other.

First, make it possible to call the initial preparatory part of
tick_nohz_idle_enter() without the tick-stopping part following
it and introduce the tick_nohz_idle_prepare() wrapper for that
(that will be used in the next set of changes).

Second, add a new stop_tick argument to __tick_nohz_idle_enter()
tell it whether or not to stop the tick (that is always set for
now) and add a wrapper allowing this function to be called from
the outside of tick-sched.c.

Just the code reorganization and two new wrapper functions, no
intended functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/tick.h     |    2 +
 kernel/time/tick-sched.c |   51 +++++++++++++++++++++++++++++++----------------
 2 files changed, 36 insertions(+), 17 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,8 @@ enum tick_dep_bits {
 #ifdef CONFIG_NO_HZ_COMMON
 extern bool tick_nohz_enabled;
 extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_prepare(void);
+extern void tick_nohz_idle_go_idle(bool stop_tick);
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -911,14 +911,14 @@ static bool can_stop_idle_tick(int cpu,
 	return true;
 }
 
-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
 	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
 	now = tick_nohz_start_idle(ts);
 
-	if (can_stop_idle_tick(cpu, ts)) {
+	if (can_stop_idle_tick(cpu, ts) && stop_tick) {
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
@@ -936,19 +936,7 @@ static void __tick_nohz_idle_enter(struc
 	}
 }
 
-/**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
- *
- * When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
- *
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- *  to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
- */
-void tick_nohz_idle_enter(void)
+void __tick_nohz_idle_prepare(void)
 {
 	struct tick_sched *ts;
 
@@ -965,7 +953,36 @@ void tick_nohz_idle_enter(void)
 
 	ts = this_cpu_ptr(&tick_cpu_sched);
 	ts->inidle = 1;
-	__tick_nohz_idle_enter(ts);
+}
+
+/**
+ * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
+ *
+ * Called when we start the idle loop.
+ */
+void tick_nohz_idle_prepare(void)
+{
+	__tick_nohz_idle_prepare();
+
+	local_irq_enable();
+}
+
+/**
+ * tick_nohz_idle_go_idle - start idle period on the current CPU.
+ * @stop_tick: Whether or not to stop the idle tick.
+ *
+ * When @stop_tick is set and the next event is more than a tick into the
+ * future, stop the idle tick.
+ */
+void tick_nohz_idle_go_idle(bool stop_tick)
+{
+	__tick_nohz_idle_enter(this_cpu_ptr(&tick_cpu_sched), stop_tick);
+}
+
+void tick_nohz_idle_enter(void)
+{
+	__tick_nohz_idle_prepare();
+	tick_nohz_idle_go_idle(true);
 
 	local_irq_enable();
 }
@@ -983,7 +1000,7 @@ void tick_nohz_irq_exit(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (ts->inidle)
-		__tick_nohz_idle_enter(ts);
+		__tick_nohz_idle_enter(ts, true);
 	else
 		tick_nohz_full_update_tick(ts);
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 2/7] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-04 22:24 ` [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-04 22:24 ` Rafael J. Wysocki
  2018-03-04 22:24 ` [RFC/RFT][PATCH 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the timekeeping code on the duration
of the upcoming idle period.  Specifically, if the tick has been
stopped and the idle governor predicts short idle, the situation is
bad regardless of whether or not the prediction is accurate.  If it
is accurate, the tick has been stopped unnecessarily which means
excessive overhead.  If it is not accurate, the CPU is likely to
spend too much time in the (shallow, because short idle has been
predicted) idle state selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
information to whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c      |   13 ++++++++++---
 kernel/time/tick-sched.c |    2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -220,13 +220,17 @@ static void do_idle(void)
 	 */
 
 	__current_set_polling();
-	tick_nohz_idle_enter();
+	tick_nohz_idle_prepare();
 
 	while (!need_resched()) {
 		check_pgt_cache();
 		rmb();
 
 		if (cpu_is_offline(cpu)) {
+			local_irq_disable();
+			tick_nohz_idle_go_idle(true);
+			local_irq_enable();
+
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
@@ -240,10 +244,13 @@ static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired())
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			tick_nohz_idle_go_idle(false);
 			cpu_idle_poll();
-		else
+		} else {
+			tick_nohz_idle_go_idle(true);
 			cpuidle_idle_call();
+		}
 		arch_cpu_idle_exit();
 	}
 
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1000,7 +1000,7 @@ void tick_nohz_irq_exit(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (ts->inidle)
-		__tick_nohz_idle_enter(ts, true);
+		__tick_nohz_idle_enter(ts, false);
 	else
 		tick_nohz_full_update_tick(ts);
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-04 22:24 ` [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
  2018-03-04 22:24 ` [RFC/RFT][PATCH 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
@ 2018-03-04 22:24 ` Rafael J. Wysocki
  2018-03-04 22:26 ` [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection Rafael J. Wysocki
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, there are code paths in cpuidle_idle_call() that don't need
the tick to be stopped.  In particular, the cpuidle_enter_s2idle()
path deals with the tick (and with the entire timekeeping for that
matter) by itself and it doesn't need the tick to be stopped
beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of some governor code with respect
to some code in tick_nohz_idle_go_idle().  For this purpose, call
tick_nohz_idle_go_idle() in the same branch as cpuidle_select().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,14 @@ static void cpuidle_idle_call(void)
 	}
 
 	/*
-	 * Tell the RCU framework we are entering an idle section,
-	 * so no more rcu read side critical sections and one more
+	 * The RCU framework needs to be told that we are entering an idle
+	 * section, so no more rcu read side critical sections and one more
 	 * step to the grace period
 	 */
-	rcu_idle_enter();
 
 	if (cpuidle_not_available(drv, dev)) {
+		tick_nohz_idle_go_idle(false);
+		rcu_idle_enter();
 		default_idle_call();
 		goto exit_idle;
 	}
@@ -169,6 +170,9 @@ static void cpuidle_idle_call(void)
 
 	if (idle_should_enter_s2idle() || dev->use_deepest_state) {
 		if (idle_should_enter_s2idle()) {
+			tick_nohz_idle_go_idle(false);
+			rcu_idle_enter();
+
 			entered_state = cpuidle_enter_s2idle(drv, dev);
 			if (entered_state > 0) {
 				local_irq_enable();
@@ -176,9 +180,15 @@ static void cpuidle_idle_call(void)
 			}
 		}
 
+		tick_nohz_idle_go_idle(true);
+		rcu_idle_enter();
+
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		tick_nohz_idle_go_idle(true);
+		rcu_idle_enter();
+
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
@@ -248,7 +258,6 @@ static void do_idle(void)
 			tick_nohz_idle_go_idle(false);
 			cpu_idle_poll();
 		} else {
-			tick_nohz_idle_go_idle(true);
 			cpuidle_idle_call();
 		}
 		arch_cpu_idle_exit();

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2018-03-04 22:24 ` [RFC/RFT][PATCH 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
@ 2018-03-04 22:26 ` Rafael J. Wysocki
  2018-03-05 11:38   ` Peter Zijlstra
  2018-03-04 22:27 ` [RFC/RFT][PATCH 5/7] cpuidle: New governor callback for predicting idle duration Rafael J. Wysocki
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:26 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, prepare the
menu governor code for reordering with respect to the timekeeping
code that stops the tick.

Use the observation that menu_select() can be split into two
functions, one predicting the idle duration and one selecting the
idle state, and rework it accordingly.

Introduce menu_predict() to predict the idle duration and (for now)
call it from menu_select().  The next set of changes will use
menu_predict() as a new governor callback.

This change is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   68 +++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -130,6 +130,7 @@ struct menu_device {
 	unsigned int	correction_factor[BUCKETS];
 	unsigned int	intervals[INTERVALS];
 	int		interval_ptr;
+	int		latency_req;
 };
 
 
@@ -275,36 +276,30 @@ again:
 	goto again;
 }
 
-/**
- * menu_select - selects the next idle state to enter
- * @drv: cpuidle driver containing state data
- * @dev: the CPU
- */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static void menu_predict(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
-	int i;
-	int first_idx;
-	int idx;
+	int resume_latency = dev_pm_qos_raw_read_value(device);
 	unsigned int interactivity_req;
 	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
-	int resume_latency = dev_pm_qos_raw_read_value(device);
 
 	if (data->needs_update) {
 		menu_update(drv, dev);
 		data->needs_update = 0;
 	}
 
-	if (resume_latency < latency_req &&
+	data->latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	if (resume_latency < data->latency_req &&
 	    resume_latency != PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
-		latency_req = resume_latency;
+		data->latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
-		return 0;
+	if (unlikely(data->latency_req == 0)) {
+		data->predicted_us = 0;
+		return;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -324,6 +319,32 @@ static int menu_select(struct cpuidle_dr
 	expected_interval = get_typical_interval(data);
 	expected_interval = min(expected_interval, data->next_timer_us);
 
+	/*
+	 * Use the lowest expected idle interval to pick the idle state.
+	 */
+	data->predicted_us = min(data->predicted_us, expected_interval);
+
+	/*
+	 * Use the performance multiplier and the user-configurable
+	 * latency_req to determine the maximum exit latency.
+	 */
+	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+	if (data->latency_req > interactivity_req)
+		data->latency_req = interactivity_req;
+}
+
+/**
+ * menu_select - selects the next idle state to enter
+ * @drv: cpuidle driver containing state data
+ * @dev: the CPU
+ */
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	struct menu_device *data = this_cpu_ptr(&menu_devices);
+	int first_idx, idx, i;
+
+	menu_predict(drv, dev);
+
 	first_idx = 0;
 	if (drv->states[0].flags & CPUIDLE_FLAG_POLLING) {
 		struct cpuidle_state *s = &drv->states[1];
@@ -336,25 +357,12 @@ static int menu_select(struct cpuidle_dr
 		 */
 		polling_threshold = max_t(unsigned int, 20, s->target_residency);
 		if (data->next_timer_us > polling_threshold &&
-		    latency_req > s->exit_latency && !s->disabled &&
+		    data->latency_req > s->exit_latency && !s->disabled &&
 		    !dev->states_usage[1].disable)
 			first_idx = 1;
 	}
 
 	/*
-	 * Use the lowest expected idle interval to pick the idle state.
-	 */
-	data->predicted_us = min(data->predicted_us, expected_interval);
-
-	/*
-	 * Use the performance multiplier and the user-configurable
-	 * latency_req to determine the maximum exit latency.
-	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
-	if (latency_req > interactivity_req)
-		latency_req = interactivity_req;
-
-	/*
 	 * Find the idle state with the lowest power while satisfying
 	 * our constraints.
 	 */
@@ -369,7 +377,7 @@ static int menu_select(struct cpuidle_dr
 			idx = i; /* first enabled state */
 		if (s->target_residency > data->predicted_us)
 			break;
-		if (s->exit_latency > latency_req)
+		if (s->exit_latency > data->latency_req)
 			break;
 
 		idx = i;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 5/7] cpuidle: New governor callback for predicting idle duration
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2018-03-04 22:26 ` [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection Rafael J. Wysocki
@ 2018-03-04 22:27 ` Rafael J. Wysocki
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
  2018-03-04 22:29 ` [RFC/RFT][PATCH 7/7] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki
  6 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:27 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, introduce a
new cpuidle governor callback for predicting idle duration and make
cpuidle_select() use it to obtain an idle duration estimate.  Also
make cpuidle_select() return the expected idle duration to its caller
through an additional argument pointer.

For the menu governor, make the new callback pointer point to the
menu_predict() routine introduced previously and stop calling it
directly from menu_select().

This change is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/cpuidle.c        |   37 +++++++++++++++++++++++++++++++++++--
 drivers/cpuidle/governors/menu.c |   10 ++++++----
 include/linux/cpuidle.h          |    5 ++++-
 kernel/sched/idle.c              |    4 +++-
 4 files changed, 48 insertions(+), 8 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
 				  struct cpuidle_device *dev);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev,
+			  unsigned int *duration_us_ptr);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -245,6 +246,8 @@ struct cpuidle_governor {
 	void (*disable)		(struct cpuidle_driver *drv,
 					struct cpuidle_device *dev);
 
+	unsigned int (*predict) (struct cpuidle_driver *drv,
+					struct cpuidle_device *dev);
 	int  (*select)		(struct cpuidle_driver *drv,
 					struct cpuidle_device *dev);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -276,7 +276,8 @@ again:
 	goto again;
 }
 
-static void menu_predict(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static unsigned int menu_predict(struct cpuidle_driver *drv,
+				 struct cpuidle_device *dev)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
@@ -298,7 +299,7 @@ static void menu_predict(struct cpuidle_
 	/* Special case when user has set very strict latency requirement */
 	if (unlikely(data->latency_req == 0)) {
 		data->predicted_us = 0;
-		return;
+		return 0;
 	}
 
 	/* determine the expected residency time, round up */
@@ -331,6 +332,8 @@ static void menu_predict(struct cpuidle_
 	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
 	if (data->latency_req > interactivity_req)
 		data->latency_req = interactivity_req;
+
+	return data->predicted_us;
 }
 
 /**
@@ -343,8 +346,6 @@ static int menu_select(struct cpuidle_dr
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	int first_idx, idx, i;
 
-	menu_predict(drv, dev);
-
 	first_idx = 0;
 	if (drv->states[0].flags & CPUIDLE_FLAG_POLLING) {
 		struct cpuidle_state *s = &drv->states[1];
@@ -505,6 +506,7 @@ static struct cpuidle_governor menu_gove
 	.name =		"menu",
 	.rating =	20,
 	.enable =	menu_enable_device,
+	.predict =	menu_predict,
 	.select =	menu_select,
 	.reflect =	menu_reflect,
 };
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,45 @@ int cpuidle_enter_state(struct cpuidle_d
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @duration_us_ptr: pointer to return the expected duration of idle period
  *
  * Returns the index of the idle state.  The return value must not be negative.
+ *
+ * The memory location pointed to by @duration_us_ptr is written the expected
+ * duration of the upcoming idle period, in microseconds.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   unsigned int *duration_us_ptr)
 {
-	return cpuidle_curr_governor->select(drv, dev);
+	unsigned int duration_us;
+	int ret, i;
+
+	if (!cpuidle_curr_governor->predict) {
+		ret = cpuidle_curr_governor->select(drv, dev);
+		*duration_us_ptr = drv->states[ret].target_residency;
+		return ret;
+	}
+
+	duration_us = cpuidle_curr_governor->predict(drv, dev);
+
+	ret = cpuidle_curr_governor->select(drv, dev);
+
+	/*
+	 * Return the target residency of the selected state as the expected
+	 * idle period duration if there are any states available with target
+	 * residencies greater than the predicted idle period duration (to
+	 * avoid staying in a shallow state for too long).
+	 */
+	for (i = ret + 1; i < drv->state_count; i++)
+		if (!drv->states[i].disabled &&
+		    !dev->states_usage[i].disable &&
+		    drv->states[i].target_residency > duration_us) {
+			duration_us = drv->states[ret].target_residency;
+			break;
+		}
+
+	*duration_us_ptr = duration_us;
+	return ret;
 }
 
 /**
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -186,13 +186,15 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		unsigned int duration_us;
+
 		tick_nohz_idle_go_idle(true);
 		rcu_idle_enter();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
-		next_state = cpuidle_select(drv, dev);
+		next_state = cpuidle_select(drv, dev, &duration_us);
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2018-03-04 22:27 ` [RFC/RFT][PATCH 5/7] cpuidle: New governor callback for predicting idle duration Rafael J. Wysocki
@ 2018-03-04 22:28 ` Rafael J. Wysocki
  2018-03-05 11:45   ` Peter Zijlstra
                     ` (3 more replies)
  2018-03-04 22:29 ` [RFC/RFT][PATCH 7/7] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki
  6 siblings, 4 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:28 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the expected idle period
duration returned by cpuidle_select() to tell tick_nohz_idle_go_idle()
whether or not to stop the tick.

This isn't straightforward, because menu_predict() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, notice that tick_nohz_stop_sched_tick() already computes the
next timer event time to reprogram the scheduler tick hrtimer and
that time can be used as a proxy for the actual next timer event
time in the idle duration predicition.

Accordingly, rename the original tick_nohz_stop_sched_tick() to
__tick_nohz_next_event() and add the stop_tick argument indicating
whether or not to stop the tick to it.  If that argument is 'true',
the function will work like the original tick_nohz_stop_sched_tick(),
but otherwise it will just compute the next event time without
stopping the tick.  Next, redefine tick_nohz_stop_sched_tick() as
a wrapper around the new function.

Following that, make tick_nohz_get_sleep_length() call
__tick_nohz_next_event() to compute the next timer event time
and make it use the new last_jiffies_update field in struct
tick_sched to tell __tick_nohz_idle_enter() to skip some code
that has run already.

[After this change the __tick_nohz_next_event() code computing the
 next event time will run twice in a row if the expected idle period
 duration coming from cpuidle_select() is large enough which is sort
 of ugly, but the next set of changes deals with that separately.
 To do that, it uses the value of the last_jiffies_update field in
 struct tick_sched introduced here, among other things.]

Finally, drop the now redundant sleep_length field from struct
tick_sched.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c      |    7 ++---
 kernel/time/tick-sched.c |   64 +++++++++++++++++++++++++++++++++--------------
 kernel/time/tick-sched.h |    3 --
 3 files changed, 50 insertions(+), 24 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
 	} else {
 		unsigned int duration_us;
 
-		tick_nohz_idle_go_idle(true);
-		rcu_idle_enter();
-
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
 		next_state = cpuidle_select(drv, dev, &duration_us);
+
+		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);
+		rcu_idle_enter();
+
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,8 +655,8 @@ static inline bool local_timer_softirq_p
 	return local_softirq_pending() & TIMER_SOFTIRQ;
 }
 
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
-					 ktime_t now, int cpu)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
+				      bool stop_tick)
 {
 	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
@@ -670,6 +670,7 @@ static ktime_t tick_nohz_stop_sched_tick
 		basejiff = jiffies;
 	} while (read_seqretry(&jiffies_lock, seq));
 	ts->last_jiffies = basejiff;
+	ts->last_jiffies_update = basemono;
 
 	/*
 	 * Keep the periodic tick, when RCU, architecture or irq_work
@@ -732,8 +733,10 @@ static ktime_t tick_nohz_stop_sched_tick
 	 */
 	delta = timekeeping_max_deferment();
 	if (cpu == tick_do_timer_cpu) {
-		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-		ts->do_timer_last = 1;
+		if (stop_tick) {
+			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+			ts->do_timer_last = 1;
+		}
 	} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
 		delta = KTIME_MAX;
 		ts->do_timer_last = 0;
@@ -756,6 +759,12 @@ static ktime_t tick_nohz_stop_sched_tick
 	expires = min_t(u64, expires, next_tick);
 	tick = expires;
 
+	if (!stop_tick) {
+		/* Undo the effect of get_next_timer_interrupt(). */
+		timer_clear_idle();
+		goto out;
+	}
+
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
@@ -804,14 +813,14 @@ static ktime_t tick_nohz_stop_sched_tick
 	else
 		tick_program_event(tick, 1);
 out:
-	/*
-	 * Update the estimated sleep length until the next timer
-	 * (not only the tick).
-	 */
-	ts->sleep_length = ktime_sub(dev->next_event, now);
 	return tick;
 }
 
+static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+	return __tick_nohz_next_event(ts, cpu, true);
+}
+
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
@@ -847,7 +856,7 @@ static void tick_nohz_full_update_tick(s
 		return;
 
 	if (can_stop_full_tick(cpu, ts))
-		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+		tick_nohz_stop_sched_tick(ts, cpu);
 	else if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick(ts, ktime_get());
 #endif
@@ -873,10 +882,8 @@ static bool can_stop_idle_tick(int cpu,
 		return false;
 	}
 
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
-		ts->sleep_length = NSEC_PER_SEC / HZ;
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
 		return false;
-	}
 
 	if (need_resched())
 		return false;
@@ -913,17 +920,22 @@ static bool can_stop_idle_tick(int cpu,
 
 static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
-	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
-	now = tick_nohz_start_idle(ts);
+	if (!ts->last_jiffies_update) {
+		/* tick_nohz_get_sleep_length() has not run. */
+		tick_nohz_start_idle(ts);
+		if (!can_stop_idle_tick(cpu, ts))
+			return;
+	}
 
-	if (can_stop_idle_tick(cpu, ts) && stop_tick) {
+	if (stop_tick) {
 		int was_stopped = ts->tick_stopped;
+		ktime_t expires;
 
 		ts->idle_calls++;
 
-		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		expires = tick_nohz_stop_sched_tick(ts, cpu);
 		if (expires > 0LL) {
 			ts->idle_sleeps++;
 			ts->idle_expires = expires;
@@ -934,6 +946,8 @@ static void __tick_nohz_idle_enter(struc
 			nohz_balance_enter_idle(cpu);
 		}
 	}
+
+	ts->last_jiffies_update = 0;
 }
 
 void __tick_nohz_idle_prepare(void)
@@ -1006,15 +1020,27 @@ void tick_nohz_irq_exit(void)
 }
 
 /**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
  *
  * Called from power state control code with interrupts disabled
  */
 ktime_t tick_nohz_get_sleep_length(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+	int cpu = smp_processor_id();
+	ktime_t now, next_event;
 
-	return ts->sleep_length;
+	now = tick_nohz_start_idle(ts);
+
+	if (can_stop_idle_tick(cpu, ts)) {
+		next_event = __tick_nohz_next_event(ts, cpu, false);
+	} else {
+		struct clock_event_device *dev;
+
+		dev = __this_cpu_read(tick_cpu_device.evtdev);
+		next_event = dev->next_event;
+	}
+	return ktime_sub(next_event, now);;
 }
 
 /**
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -38,7 +38,6 @@ enum tick_nohz_mode {
  * @idle_exittime:	Time when the idle state was left
  * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
  * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length:	Duration of the current idle sleep
  * @do_timer_lst:	CPU was the last one doing do_timer before going idle
  */
 struct tick_sched {
@@ -58,8 +57,8 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
-	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
+	u64				last_jiffies_update;
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC/RFT][PATCH 7/7] time: tick-sched: Avoid running the same code twice in a row
  2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
@ 2018-03-04 22:29 ` Rafael J. Wysocki
  6 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-04 22:29 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

To avoid running the same piece of code twice in a row, move the
tick stopping part of __tick_nohz_next_event() into a new function
called __tick_nohz_stop_tick() and invoke them both separately.

Make __tick_nohz_idle_enter() avoid calling __tick_nohz_next_event()
if it has been called already by tick_nohz_get_sleep_length() and
use the new next_idle_tick field in struct tick_sched to pass the
next event time value between tick_nohz_get_sleep_length() and
__tick_nohz_idle_enter().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/time/tick-sched.c |  130 ++++++++++++++++++++++++++---------------------
 kernel/time/tick-sched.h |    1 
 2 files changed, 73 insertions(+), 58 deletions(-)

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,13 +655,10 @@ static inline bool local_timer_softirq_p
 	return local_softirq_pending() & TIMER_SOFTIRQ;
 }
 
-static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
-				      bool stop_tick)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
 	unsigned long seq, basejiff;
-	ktime_t	tick;
 
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
@@ -714,34 +711,23 @@ static ktime_t __tick_nohz_next_event(st
 		 * We've not stopped the tick yet, and there's a timer in the
 		 * next period, so no point in stopping it either, bail.
 		 */
-		if (!ts->tick_stopped) {
-			tick = 0;
-			goto out;
-		}
+		if (!ts->tick_stopped)
+			return 0;
 	}
 
 	/*
-	 * If this CPU is the one which updates jiffies, then give up
-	 * the assignment and let it be taken by the CPU which runs
-	 * the tick timer next, which might be this CPU as well. If we
-	 * don't drop this here the jiffies might be stale and
-	 * do_timer() never invoked. Keep track of the fact that it
-	 * was the one which had the do_timer() duty last. If this CPU
-	 * is the one which had the do_timer() duty last, we limit the
-	 * sleep time to the timekeeping max_deferment value.
+	 * If this CPU is the one which had the do_timer() duty last, we limit
+	 * the sleep time to the timekeeping max_deferment value.
 	 * Otherwise we can sleep as long as we want.
 	 */
 	delta = timekeeping_max_deferment();
-	if (cpu == tick_do_timer_cpu) {
-		if (stop_tick) {
-			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-			ts->do_timer_last = 1;
+	if (cpu != tick_do_timer_cpu) {
+		if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+			delta = KTIME_MAX;
+			ts->do_timer_last = 0;
+		} else if (!ts->do_timer_last) {
+			delta = KTIME_MAX;
 		}
-	} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
-		delta = KTIME_MAX;
-		ts->do_timer_last = 0;
-	} else if (!ts->do_timer_last) {
-		delta = KTIME_MAX;
 	}
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -756,24 +742,37 @@ static ktime_t __tick_nohz_next_event(st
 	else
 		expires = KTIME_MAX;
 
-	expires = min_t(u64, expires, next_tick);
-	tick = expires;
+	ts->next_idle_tick = min_t(u64, expires, next_tick);
+	return ts->next_idle_tick;
+}
 
-	if (!stop_tick) {
-		/* Undo the effect of get_next_timer_interrupt(). */
-		timer_clear_idle();
-		goto out;
+static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu, u64 expires)
+{
+	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+	ktime_t tick = expires;
+
+	/*
+	 * If this CPU is the one which updates jiffies, then give up
+	 * the assignment and let it be taken by the CPU which runs
+	 * the tick timer next, which might be this CPU as well. If we
+	 * don't drop this here the jiffies might be stale and
+	 * do_timer() never invoked. Keep track of the fact that it
+	 * was the one which had the do_timer() duty last.
+	 */
+	if (cpu == tick_do_timer_cpu) {
+		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+		ts->do_timer_last = 1;
 	}
 
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
 		if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
-			goto out;
+			return;
 
 		WARN_ON_ONCE(1);
 		printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
-			    basemono, ts->next_tick, dev->next_event,
+			    ts->last_jiffies_update, ts->next_tick, dev->next_event,
 			    hrtimer_active(&ts->sched_timer), hrtimer_get_expires(&ts->sched_timer));
 	}
 
@@ -803,7 +802,7 @@ static ktime_t __tick_nohz_next_event(st
 	if (unlikely(expires == KTIME_MAX)) {
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
 			hrtimer_cancel(&ts->sched_timer);
-		goto out;
+		return;
 	}
 
 	hrtimer_set_expires(&ts->sched_timer, tick);
@@ -812,14 +811,18 @@ static ktime_t __tick_nohz_next_event(st
 		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
 	else
 		tick_program_event(tick, 1);
-out:
-	return tick;
 }
 
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
 {
-	return __tick_nohz_next_event(ts, cpu, true);
+	u64 next_tick;
+
+	next_tick = __tick_nohz_next_event(ts, cpu);
+	if (next_tick)
+		__tick_nohz_stop_tick(ts, cpu, next_tick);
 }
+#endif /* CONFIG_NO_HZ_FULL */
 
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
@@ -921,32 +924,43 @@ static bool can_stop_idle_tick(int cpu,
 static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
 	int cpu = smp_processor_id();
+	ktime_t expires;
+	int was_stopped;
 
-	if (!ts->last_jiffies_update) {
-		/* tick_nohz_get_sleep_length() has not run. */
+	if (ts->last_jiffies_update) {
+		if (!stop_tick)
+			goto out;
+
+		/*
+		 * tick_nohz_get_sleep_length() has run, so the tick timer
+		 * expiration time has been computed already.
+		 */
+		expires = ts->next_idle_tick;
+	} else {
 		tick_nohz_start_idle(ts);
-		if (!can_stop_idle_tick(cpu, ts))
+		if (!can_stop_idle_tick(cpu, ts) || !stop_tick)
 			return;
+
+		expires = __tick_nohz_next_event(ts, cpu);
 	}
 
-	if (stop_tick) {
-		int was_stopped = ts->tick_stopped;
-		ktime_t expires;
-
-		ts->idle_calls++;
-
-		expires = tick_nohz_stop_sched_tick(ts, cpu);
-		if (expires > 0LL) {
-			ts->idle_sleeps++;
-			ts->idle_expires = expires;
-		}
+	ts->idle_calls++;
 
-		if (!was_stopped && ts->tick_stopped) {
-			ts->idle_jiffies = ts->last_jiffies;
-			nohz_balance_enter_idle(cpu);
-		}
+	was_stopped = ts->tick_stopped;
+
+	if (expires > 0LL) {
+		__tick_nohz_stop_tick(ts, cpu, expires);
+
+		ts->idle_sleeps++;
+		ts->idle_expires = expires;
 	}
 
+	if (!was_stopped && ts->tick_stopped) {
+		ts->idle_jiffies = ts->last_jiffies;
+		nohz_balance_enter_idle(cpu);
+	}
+
+out:
 	ts->last_jiffies_update = 0;
 }
 
@@ -957,7 +971,7 @@ void __tick_nohz_idle_prepare(void)
 	lockdep_assert_irqs_enabled();
 	/*
 	 * Update the idle state in the scheduler domain hierarchy
-	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+	 * when __tick_nohz_stop_tick() is called from the idle loop.
 	 * State will be updated to busy during the first busy tick after
 	 * exiting idle.
 	 */
@@ -1033,7 +1047,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 	now = tick_nohz_start_idle(ts);
 
 	if (can_stop_idle_tick(cpu, ts)) {
-		next_event = __tick_nohz_next_event(ts, cpu, false);
+		next_event = __tick_nohz_next_event(ts, cpu);
 	} else {
 		struct clock_event_device *dev;
 
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -59,6 +59,7 @@ struct tick_sched {
 	ktime_t				iowait_sleeptime;
 	unsigned long			last_jiffies;
 	u64				last_jiffies_update;
+	u64				next_idle_tick;
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-04 22:24 ` [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-05 10:44   ` Peter Zijlstra
  2018-03-05 11:26     ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 10:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Sun, Mar 04, 2018 at 11:24:00PM +0100, Rafael J. Wysocki wrote:
> +/**
> + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> + *
> + * Called when we start the idle loop.
> + */
> +void tick_nohz_idle_prepare(void)
> +{
> +	__tick_nohz_idle_prepare();
> +
> +	local_irq_enable();
> +}

I really dislike the asymmetry in IRQ state you introduced here.
__tick_nohz_idle_prepare() disables IRQs. Must we do that?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code
  2018-03-05 10:44   ` Peter Zijlstra
@ 2018-03-05 11:26     ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 11:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 04, 2018 at 11:24:00PM +0100, Rafael J. Wysocki wrote:
>> +/**
>> + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
>> + *
>> + * Called when we start the idle loop.
>> + */
>> +void tick_nohz_idle_prepare(void)
>> +{
>> +     __tick_nohz_idle_prepare();
>> +
>> +     local_irq_enable();
>> +}
>
> I really dislike the asymmetry in IRQ state you introduced here.
> __tick_nohz_idle_prepare() disables IRQs. Must we do that?

Not really, but at the cost of a tiny bit of duplicated code.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-04 22:26 ` [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection Rafael J. Wysocki
@ 2018-03-05 11:38   ` Peter Zijlstra
  2018-03-05 11:47     ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 11:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Sun, Mar 04, 2018 at 11:26:24PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> In order to address the issue with short idle duration predictions
> by the idle governor after the tick has been stopped, prepare the
> menu governor code for reordering with respect to the timekeeping
> code that stops the tick.
> 
> Use the observation that menu_select() can be split into two
> functions, one predicting the idle duration and one selecting the
> idle state, and rework it accordingly.

I actually think this is the wrong way around.

We really should be predicting state not duration. Yes the duration
thing is an intermediate value, but I don't think it makes any sense
what so ever to preserve that in the predictor. The end result is the
idle state, we should aim for that.

As per:

  https://lkml.org/lkml/2017/7/18/615

there are definite advantages to _not_ preserving duration information
beyond the state boundaries.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
@ 2018-03-05 11:45   ` Peter Zijlstra
  2018-03-05 11:50     ` Rafael J. Wysocki
  2018-03-05 12:35   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 11:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>  	} else {
>  		unsigned int duration_us;
>  
> -		tick_nohz_idle_go_idle(true);
> -		rcu_idle_enter();
> -
>  		/*
>  		 * Ask the cpuidle framework to choose a convenient idle state.
>  		 */
>  		next_state = cpuidle_select(drv, dev, &duration_us);
> +
> +		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);
> +		rcu_idle_enter();
> +
>  		entered_state = call_cpuidle(drv, dev, next_state);
>  		/*
>  		 * Give the governor an opportunity to reflect on the outcome

So I think this is entirely wrong, I would much rather see something
like:

	tick_nohz_idle_go_idle(next_state->nohz);

Where the selected state itself has the nohz property or not.

We can always insert an extra state at whatever the right boundary point
is for nohz if it doesn't line up with an existing point.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-05 11:38   ` Peter Zijlstra
@ 2018-03-05 11:47     ` Rafael J. Wysocki
  2018-03-05 12:50       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 04, 2018 at 11:26:24PM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> In order to address the issue with short idle duration predictions
>> by the idle governor after the tick has been stopped, prepare the
>> menu governor code for reordering with respect to the timekeeping
>> code that stops the tick.
>>
>> Use the observation that menu_select() can be split into two
>> functions, one predicting the idle duration and one selecting the
>> idle state, and rework it accordingly.
>
> I actually think this is the wrong way around.

In the meantime I concluded that this patch and the next one are
overdesign really.  I have a replacement for the two already. :-)

The only thing I need is the expected idle duration that can be
returned from ->select too.

> We really should be predicting state not duration. Yes the duration
> thing is an intermediate value, but I don't think it makes any sense
> what so ever to preserve that in the predictor. The end result is the
> idle state, we should aim for that.
>
> As per:
>
>   https://lkml.org/lkml/2017/7/18/615
>
> there are definite advantages to _not_ preserving duration information
> beyond the state boundaries.

Well, OK

The reason why I need the predicted idle duration is because the
target residency of the selected state may be below the tick period
duration and if this is the deepest state available, we still want to
stop the tick if the predicted idle duration is long.

IOW, the target residency of the selected state doesn't tell you how
much time you should expect to be idle in general.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 11:45   ` Peter Zijlstra
@ 2018-03-05 11:50     ` Rafael J. Wysocki
  2018-03-05 12:07       ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 12:45 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
>> Index: linux-pm/kernel/sched/idle.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/idle.c
>> +++ linux-pm/kernel/sched/idle.c
>> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>>       } else {
>>               unsigned int duration_us;
>>
>> -             tick_nohz_idle_go_idle(true);
>> -             rcu_idle_enter();
>> -
>>               /*
>>                * Ask the cpuidle framework to choose a convenient idle state.
>>                */
>>               next_state = cpuidle_select(drv, dev, &duration_us);
>> +
>> +             tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);
>> +             rcu_idle_enter();
>> +
>>               entered_state = call_cpuidle(drv, dev, next_state);
>>               /*
>>                * Give the governor an opportunity to reflect on the outcome
>
> So I think this is entirely wrong, I would much rather see something
> like:
>
>         tick_nohz_idle_go_idle(next_state->nohz);
>
> Where the selected state itself has the nohz property or not.

Can you elaborate here, I'm not following?

> We can always insert an extra state at whatever the right boundary point
> is for nohz if it doesn't line up with an existing point.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 11:50     ` Rafael J. Wysocki
@ 2018-03-05 12:07       ` Rafael J. Wysocki
  2018-03-05 12:42         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 12:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 12:50 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Mon, Mar 5, 2018 at 12:45 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
>>> Index: linux-pm/kernel/sched/idle.c
>>> ===================================================================
>>> --- linux-pm.orig/kernel/sched/idle.c
>>> +++ linux-pm/kernel/sched/idle.c
>>> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>>>       } else {
>>>               unsigned int duration_us;
>>>
>>> -             tick_nohz_idle_go_idle(true);
>>> -             rcu_idle_enter();
>>> -
>>>               /*
>>>                * Ask the cpuidle framework to choose a convenient idle state.
>>>                */
>>>               next_state = cpuidle_select(drv, dev, &duration_us);
>>> +
>>> +             tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);
>>> +             rcu_idle_enter();
>>> +
>>>               entered_state = call_cpuidle(drv, dev, next_state);
>>>               /*
>>>                * Give the governor an opportunity to reflect on the outcome
>>
>> So I think this is entirely wrong, I would much rather see something
>> like:
>>
>>         tick_nohz_idle_go_idle(next_state->nohz);
>>
>> Where the selected state itself has the nohz property or not.
>
> Can you elaborate here, I'm not following?
>
>> We can always insert an extra state at whatever the right boundary point
>> is for nohz if it doesn't line up with an existing point.

OK, I guess I know what you mean: to add a state flag meaning "stop
the tick if this state is selected".

That could work, but I see problems, like having to go through all of
the already defined states and deciding what to do with them.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
  2018-03-05 11:45   ` Peter Zijlstra
@ 2018-03-05 12:35   ` Peter Zijlstra
  2018-03-05 12:56     ` Rafael J. Wysocki
  2018-03-05 13:19     ` Rik van Riel
  2018-03-05 15:36   ` Thomas Ilsche
  2018-03-05 23:27   ` Rik van Riel
  3 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 12:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>  	} else {
>  		unsigned int duration_us;
>  
> -		tick_nohz_idle_go_idle(true);
> -		rcu_idle_enter();
> -
>  		/*
>  		 * Ask the cpuidle framework to choose a convenient idle state.
>  		 */
>  		next_state = cpuidle_select(drv, dev, &duration_us);
> +
> +		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);

(FWIW we have TICK_USEC for this)

> +		rcu_idle_enter();
> +
>  		entered_state = call_cpuidle(drv, dev, next_state);
>  		/*
>  		 * Give the governor an opportunity to reflect on the outcome

Also, I think that at this point you've introduced a problem; by not
disabling the tick unconditionally, we'll have extra wakeups due to the
(now still running) tick, which will bias the estimation, as per
reflect(), downwards.

We should effectively discard tick wakeups when we could have entered
nohz but didn't, accumulating the idle period in reflect and only commit
once we get a !tick wakeup.

Of course, for that to work we need to somehow divine what woke us,
which is going to be tricky.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 12:07       ` Rafael J. Wysocki
@ 2018-03-05 12:42         ` Peter Zijlstra
  2018-03-05 13:00           ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 12:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 01:07:07PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 5, 2018 at 12:50 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Mon, Mar 5, 2018 at 12:45 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> >> So I think this is entirely wrong, I would much rather see something
> >> like:
> >>
> >>         tick_nohz_idle_go_idle(next_state->nohz);
> >>
> >> Where the selected state itself has the nohz property or not.
> >
> > Can you elaborate here, I'm not following?
> >
> >> We can always insert an extra state at whatever the right boundary point
> >> is for nohz if it doesn't line up with an existing point.
> 
> OK, I guess I know what you mean: to add a state flag meaning "stop
> the tick if this state is selected".

Yes, that.

> That could work, but I see problems, like having to go through all of
> the already defined states and deciding what to do with them.

Shouldn't be too hard, upon registering a cpuidle driver to the cpuidle
core, the core could go through the provided states and flag all those <
TICK_USEC as not stopping, all those > TICK_USEC as stopping and
splitting the state we'd select for TICK_NSEC sleeps, stopping it for <
and disabling it for >.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-05 11:47     ` Rafael J. Wysocki
@ 2018-03-05 12:50       ` Peter Zijlstra
  2018-03-05 13:05         ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 12:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > We really should be predicting state not duration. Yes the duration
> > thing is an intermediate value, but I don't think it makes any sense
> > what so ever to preserve that in the predictor. The end result is the
> > idle state, we should aim for that.
> >
> > As per:
> >
> >   https://lkml.org/lkml/2017/7/18/615
> >
> > there are definite advantages to _not_ preserving duration information
> > beyond the state boundaries.
> 
> Well, OK
> 
> The reason why I need the predicted idle duration is because the
> target residency of the selected state may be below the tick period
> duration and if this is the deepest state available, we still want to
> stop the tick if the predicted idle duration is long.

Right, so in that case we'd split the deepest state and mark the
resulting smaller state as not disabling the tick and the resulting
larger state as disabling the tick.

So suppose your deepest state is < TICK_USEC, then we introduce a copy
of that state, modify the boundary to be TICK_USEC and set the 'disable
tick for this state' thing to true.

> IOW, the target residency of the selected state doesn't tell you how
> much time you should expect to be idle in general.

Right, but I think that measure isn't of primary relevance. What we want
to know is: 'should I stop the tick' and 'what C state do I go to'.

In order to answer those questions we need durations as input, but I
don't think we should preserve durations throughout. The scheme from the
above link reduces to N states in order to deal with arbitrary
distributions, only the actual states -- ie boundaries where our answers
changes -- are relevant, anything inside those boundaries would lead to
the exact same answer anyway.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 12:35   ` Peter Zijlstra
@ 2018-03-05 12:56     ` Rafael J. Wysocki
  2018-03-05 13:19     ` Rik van Riel
  1 sibling, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 1:35 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
>> Index: linux-pm/kernel/sched/idle.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/idle.c
>> +++ linux-pm/kernel/sched/idle.c
>> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>>       } else {
>>               unsigned int duration_us;
>>
>> -             tick_nohz_idle_go_idle(true);
>> -             rcu_idle_enter();
>> -
>>               /*
>>                * Ask the cpuidle framework to choose a convenient idle state.
>>                */
>>               next_state = cpuidle_select(drv, dev, &duration_us);
>> +
>> +             tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC / HZ);
>
> (FWIW we have TICK_USEC for this)
>
>> +             rcu_idle_enter();
>> +
>>               entered_state = call_cpuidle(drv, dev, next_state);
>>               /*
>>                * Give the governor an opportunity to reflect on the outcome
>
> Also, I think that at this point you've introduced a problem; by not
> disabling the tick unconditionally, we'll have extra wakeups due to the
> (now still running) tick, which will bias the estimation, as per
> reflect(), downwards.

Correct, but IMO that is better than what we have now.

Also that's in theory, but in practice it may not affect the
distribution of time between different idle states that much.

> We should effectively discard tick wakeups when we could have entered
> nohz but didn't, accumulating the idle period in reflect and only commit
> once we get a !tick wakeup.
>
> Of course, for that to work we need to somehow divine what woke us,
> which is going to be tricky.

So I will only worry about this when I see numbers that will convince
me to do so. :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 12:42         ` Peter Zijlstra
@ 2018-03-05 13:00           ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 1:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 05, 2018 at 01:07:07PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 12:50 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> > On Mon, Mar 5, 2018 at 12:45 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> >> So I think this is entirely wrong, I would much rather see something
>> >> like:
>> >>
>> >>         tick_nohz_idle_go_idle(next_state->nohz);
>> >>
>> >> Where the selected state itself has the nohz property or not.
>> >
>> > Can you elaborate here, I'm not following?
>> >
>> >> We can always insert an extra state at whatever the right boundary point
>> >> is for nohz if it doesn't line up with an existing point.
>>
>> OK, I guess I know what you mean: to add a state flag meaning "stop
>> the tick if this state is selected".
>
> Yes, that.
>
>> That could work, but I see problems, like having to go through all of
>> the already defined states and deciding what to do with them.
>
> Shouldn't be too hard, upon registering a cpuidle driver to the cpuidle
> core, the core could go through the provided states and flag all those <
> TICK_USEC as not stopping, all those > TICK_USEC as stopping and
> splitting the state we'd select for TICK_NSEC sleeps, stopping it for <
> and disabling it for >.
>

Well, on Intel everything below C8 has target residencies below 1 ms. :-)

I think that we want C6 to be "nohz" too, though, at least in some cases.

And what about C3 if C6 is disabled?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-05 12:50       ` Peter Zijlstra
@ 2018-03-05 13:05         ` Rafael J. Wysocki
  2018-03-05 13:53           ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-05 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > We really should be predicting state not duration. Yes the duration
>> > thing is an intermediate value, but I don't think it makes any sense
>> > what so ever to preserve that in the predictor. The end result is the
>> > idle state, we should aim for that.
>> >
>> > As per:
>> >
>> >   https://lkml.org/lkml/2017/7/18/615
>> >
>> > there are definite advantages to _not_ preserving duration information
>> > beyond the state boundaries.
>>
>> Well, OK
>>
>> The reason why I need the predicted idle duration is because the
>> target residency of the selected state may be below the tick period
>> duration and if this is the deepest state available, we still want to
>> stop the tick if the predicted idle duration is long.
>
> Right, so in that case we'd split the deepest state and mark the
> resulting smaller state as not disabling the tick and the resulting
> larger state as disabling the tick.
>
> So suppose your deepest state is < TICK_USEC, then we introduce a copy
> of that state, modify the boundary to be TICK_USEC and set the 'disable
> tick for this state' thing to true.
>
>> IOW, the target residency of the selected state doesn't tell you how
>> much time you should expect to be idle in general.
>
> Right, but I think that measure isn't of primary relevance. What we want
> to know is: 'should I stop the tick' and 'what C state do I go to'.
>
> In order to answer those questions we need durations as input, but I
> don't think we should preserve durations throughout. The scheme from the
> above link reduces to N states in order to deal with arbitrary
> distributions, only the actual states -- ie boundaries where our answers
> changes -- are relevant, anything inside those boundaries would lead to
> the exact same answer anyway.

I generally agree here, but I'm not convinced about flagging the
states, splitting them and so on.

Maybe just return a "nohz" indicator from cpuidle_select() in addition
to the state index and make the decision in the governor?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 12:35   ` Peter Zijlstra
  2018-03-05 12:56     ` Rafael J. Wysocki
@ 2018-03-05 13:19     ` Rik van Riel
  2018-03-05 13:37       ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Rik van Riel @ 2018-03-05 13:19 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Aubrey Li, Mike Galbraith, LKML,
	Linux PM

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

On Mon, 2018-03-05 at 13:35 +0100, Peter Zijlstra wrote:
> On Sun, Mar 04, 2018 at 11:28:56PM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/idle.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/idle.c
> > +++ linux-pm/kernel/sched/idle.c
> > @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
> >  	} else {
> >  		unsigned int duration_us;
> >  
> > -		tick_nohz_idle_go_idle(true);
> > -		rcu_idle_enter();
> > -
> >  		/*
> >  		 * Ask the cpuidle framework to choose a
> > convenient idle state.
> >  		 */
> >  		next_state = cpuidle_select(drv, dev,
> > &duration_us);
> > +
> > +		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC
> > / HZ);
> 
> (FWIW we have TICK_USEC for this)
> 
> > +		rcu_idle_enter();
> > +
> >  		entered_state = call_cpuidle(drv, dev,
> > next_state);
> >  		/*
> >  		 * Give the governor an opportunity to reflect on
> > the outcome
> 
> Also, I think that at this point you've introduced a problem; by not
> disabling the tick unconditionally, we'll have extra wakeups due to
> the
> (now still running) tick, which will bias the estimation, as per
> reflect(), downwards.
> 
> We should effectively discard tick wakeups when we could have entered
> nohz but didn't, accumulating the idle period in reflect and only
> commit
> once we get a !tick wakeup.

How much of a problem would that actually be?

Don't all but the very deepest C-states have
target residencies that are orders of magnitude
smaller than the tick period?

In other words, if our sleeps end up getting
"cut short" to 600us, we will still select C6,
and it will not result in picking C3 by mistake.

This only seems to affect C7 states and deeper.

It may be worth fixing in the long run, but that
would require keeping track of whether anything
non-idle was done in-between two invocations of
do_idle(), and then checking that there.

That would include not just seeing whether there
have been any context switches on the CPU (easy?),
but also whether any non-timer interrupts were run.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 13:19     ` Rik van Riel
@ 2018-03-05 13:37       ` Peter Zijlstra
  2018-03-05 13:46         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 13:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 08:19:15AM -0500, Rik van Riel wrote:
> > Also, I think that at this point you've introduced a problem; by not
> > disabling the tick unconditionally, we'll have extra wakeups due to
> > the (now still running) tick, which will bias the estimation, as per
> > reflect(), downwards.
> > 
> > We should effectively discard tick wakeups when we could have
> > entered nohz but didn't, accumulating the idle period in reflect and
> > only commit once we get a !tick wakeup.
> 
> How much of a problem would that actually be?
> 
> Don't all but the very deepest C-states have
> target residencies that are orders of magnitude
> smaller than the tick period?
> 
> In other words, if our sleeps end up getting
> "cut short" to 600us, we will still select C6,
> and it will not result in picking C3 by mistake.
> 
> This only seems to affect C7 states and deeper.

On modern Intel, what about other platforms? This is something that
should work across the board.

> It may be worth fixing in the long run, but that
> would require keeping track of whether anything
> non-idle was done in-between two invocations of
> do_idle(), and then checking that there.
> 
> That would include not just seeing whether there
> have been any context switches on the CPU (easy?),
> but also whether any non-timer interrupts were run.

Right, its the interrupts that are 'interesting' although I suppose we
could magic something in irq_enter().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 13:37       ` Peter Zijlstra
@ 2018-03-05 13:46         ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 13:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 02:37:25PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 05, 2018 at 08:19:15AM -0500, Rik van Riel wrote:

> > > Also, I think that at this point you've introduced a problem; by not
> > > disabling the tick unconditionally, we'll have extra wakeups due to
> > > the (now still running) tick, which will bias the estimation, as per
> > > reflect(), downwards.
> > > 
> > > We should effectively discard tick wakeups when we could have
> > > entered nohz but didn't, accumulating the idle period in reflect and
> > > only commit once we get a !tick wakeup.
> > 
> > How much of a problem would that actually be?
> > 
> > Don't all but the very deepest C-states have
> > target residencies that are orders of magnitude
> > smaller than the tick period?
> > 
> > In other words, if our sleeps end up getting
> > "cut short" to 600us, we will still select C6,
> > and it will not result in picking C3 by mistake.
> > 
> > This only seems to affect C7 states and deeper.
> 
> On modern Intel, what about other platforms? This is something that
> should work across the board.

Look at this for example:

arch/arm64/boot/dts/hisilicon/hi3660.dtsi:                              min-residency-us = <20000>;

That's 20ms right there..

But on average, considering ARM64 defaults to HZ=250, most of them are
<TICK_USEC.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-05 13:05         ` Rafael J. Wysocki
@ 2018-03-05 13:53           ` Peter Zijlstra
  2018-03-06  2:15             ` Li, Aubrey
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 13:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:

> >> IOW, the target residency of the selected state doesn't tell you how
> >> much time you should expect to be idle in general.
> >
> > Right, but I think that measure isn't of primary relevance. What we want
> > to know is: 'should I stop the tick' and 'what C state do I go to'.
> >
> > In order to answer those questions we need durations as input, but I
> > don't think we should preserve durations throughout. The scheme from the
> > above link reduces to N states in order to deal with arbitrary
> > distributions, only the actual states -- ie boundaries where our answers
> > changes -- are relevant, anything inside those boundaries would lead to
> > the exact same answer anyway.
> 
> I generally agree here, but I'm not convinced about flagging the
> states, splitting them and so on.

I think linking them like that makes sense, but I can see room for
discussion...

> Maybe just return a "nohz" indicator from cpuidle_select() in addition
> to the state index and make the decision in the governor?

Much better option than returning a duration :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
  2018-03-05 11:45   ` Peter Zijlstra
  2018-03-05 12:35   ` Peter Zijlstra
@ 2018-03-05 15:36   ` Thomas Ilsche
  2018-03-05 16:50     ` Peter Zijlstra
  2018-03-05 23:27   ` Rik van Riel
  3 siblings, 1 reply; 32+ messages in thread
From: Thomas Ilsche @ 2018-03-05 15:36 UTC (permalink / raw)
  To: Rafael J. Wysocki, Peter Zijlstra
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Doug Smythies, Rik van Riel, Aubrey Li, Mike Galbraith, LKML,
	Linux PM

On 2018-03-04 23:28, Rafael J. Wysocki wrote:
> use the expected idle period
> duration returned by cpuidle_select() to tell tick_nohz_idle_go_idle()
> whether or not to stop the tick.

I assume that at the point of going idle, the actual next scheduling
tick may happen anywhere between now and 1/HZ. If there is a mechanism
that somehow ensures that the next scheduling tick always happens 1/HZ
after going idle, then some of my arguments are invalid.

Ideally, the decision whether to disable the sched tick should
primarily depend on the order of tree upcoming events: the the sched
tick, the next non-sched timer, and the heuristic prediction:

   https://marc.info/?l=linux-pm&m=151384941425947&w=2

If I read the code correctly, there is already logic deep within
__tick_nohz_idle_enter that prevents disabling the sched tick when
it is scheduled to happen after another timer, which is a good primary
condition for not stopping the sched tick. However the newly added
condition prevents stopping the sched tick in more cases where it is
undesirable.
Assume duration_us is slightly less than USEC_PER_SEC / HZ.
and next sched tick will happen in 0.1 * USEC_PER_SEC / HZ
If the prediction was accurate, the cpu will be woken up way too soon
by the not-disabled sched tick.

I fear that might even create positive feedback loops on the
heuristic, which will take into account the sleep durations for
sched tick wakeups in sort of a self fulfilling prophecy:
1) The heuristic predicts to wake up in less than a full sched period,
2) The sched tick is kept enabled
3) The sched tick wakes up the system in less than a full sched period
4) Repeat

Even when sleeping for longer than target_residency of the deepest
sleep state, you can still improve energy consumption by sleeping
longer whenever possible.

On the opposite side - undesirable shallow sleeps - the proposed patch
will basically always keep the tick enabled if there is a higher sleep
state with a target_residency <= 1/HZ. On systems with relatively low
target_residencies, such as the ones that I am primarily
investigating, this should effectively prevent long shallow sleeps.
However, on mobile systems with C10 states > 5 ms the sched tick is
not a suitable fallback timer for preventing these issues. Well, maybe
the timer itself could be used, but with a larger expiry.

So IMHO
- the precise timer and vague heuristic should not be mixed
- decisions should preferably use actual time points rather than the
   generic tick duration and residency time.
- for some cases the sched tick as is may not be sufficient as fallback

Question: Does disabling a timer on a cpu guarantee that this cpu will
wake-up or is there a scenario where a timer is deleted or moved
externally without the cpu having a chance to change it's idle state?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 15:36   ` Thomas Ilsche
@ 2018-03-05 16:50     ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-05 16:50 UTC (permalink / raw)
  To: Thomas Ilsche
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Mon, Mar 05, 2018 at 04:36:20PM +0100, Thomas Ilsche wrote:
> I fear that might even create positive feedback loops on the
> heuristic, which will take into account the sleep durations for
> sched tick wakeups in sort of a self fulfilling prophecy:
> 1) The heuristic predicts to wake up in less than a full sched period,
> 2) The sched tick is kept enabled
> 3) The sched tick wakes up the system in less than a full sched period
> 4) Repeat

Right, I pointed out this same issue. We should ignore 'ticks' for the
purpose of measuring sleep duration.

That's slightly tricky to actually do though :/

> Question: Does disabling a timer on a cpu guarantee that this cpu will
> wake-up or is there a scenario where a timer is deleted or moved
> externally without the cpu having a chance to change it's idle state?

I think we let them sleep and don't modify any hardware timers,
resulting in them waking up at the predetermined time, not find anything
to do, recompute the next timer and go back to sleep.

Waking them up to reprogram the timer hardware is far more expensive.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
                     ` (2 preceding siblings ...)
  2018-03-05 15:36   ` Thomas Ilsche
@ 2018-03-05 23:27   ` Rik van Riel
  2018-03-06  8:18     ` Rafael J. Wysocki
  3 siblings, 1 reply; 32+ messages in thread
From: Rik van Riel @ 2018-03-05 23:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker
  Cc: Paul McKenney, Thomas Ilsche, Doug Smythies, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

[-- Attachment #1: Type: text/plain, Size: 820 bytes --]

On Sun, 2018-03-04 at 23:28 +0100, Rafael J. Wysocki wrote:
> 
> +++ linux-pm/kernel/sched/idle.c
> @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
>  	} else {
>  		unsigned int duration_us;
>  
> -		tick_nohz_idle_go_idle(true);
> -		rcu_idle_enter();
> -
>  		/*
>  		 * Ask the cpuidle framework to choose a convenient
> idle state.
>  		 */
>  		next_state = cpuidle_select(drv, dev, &duration_us);
> +
> +		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC /
> HZ);
> +		rcu_idle_enter();
> +
>  		entered_state = call_cpuidle(drv, dev, next_state);

When the expected idle period is short enough
that the timer is not stopped, does it make
sense to still call rcu_idle_enter?

Should rcu_idle_enter also be conditional on
the expected idle period?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-05 13:53           ` Peter Zijlstra
@ 2018-03-06  2:15             ` Li, Aubrey
  2018-03-06  8:45               ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Li, Aubrey @ 2018-03-06  2:15 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Mike Galbraith, LKML, Linux PM

On 2018/3/5 21:53, Peter Zijlstra wrote:
> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> 
>>>> IOW, the target residency of the selected state doesn't tell you how
>>>> much time you should expect to be idle in general.
>>>
>>> Right, but I think that measure isn't of primary relevance. What we want
>>> to know is: 'should I stop the tick' and 'what C state do I go to'.

I understood the benefit of mapping duration to state number, is duration <->
state number mapping a generic solution to all arches? 

Back to the user's concern is, "I'm running a latency sensitive application, and
I want idle switching ASAP". So I think the user may not care about what C state
to go into, that is, even if a deeper state has chance to go, the user striving
for a higher workload score may still not want it?

>>>
>>> In order to answer those questions we need durations as input, but I
>>> don't think we should preserve durations throughout. The scheme from the
>>> above link reduces to N states in order to deal with arbitrary
>>> distributions, only the actual states -- ie boundaries where our answers
>>> changes -- are relevant, anything inside those boundaries would lead to
>>> the exact same answer anyway.
>>
>> I generally agree here, but I'm not convinced about flagging the
>> states, splitting them and so on.
> 
> I think linking them like that makes sense, but I can see room for
> discussion...
> 
>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>> to the state index and make the decision in the governor?
> 
> Much better option than returning a duration :-)
>
So what does "nohz = disable and state index = deepest" mean? This combination
does not make sense for performance only purpose?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick
  2018-03-05 23:27   ` Rik van Riel
@ 2018-03-06  8:18     ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  8:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Aubrey Li,
	Mike Galbraith, LKML, Linux PM

On Tuesday, March 6, 2018 12:27:01 AM CET Rik van Riel wrote:
> On Sun, 2018-03-04 at 23:28 +0100, Rafael J. Wysocki wrote:
> > 
> > +++ linux-pm/kernel/sched/idle.c
> > @@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
> >  	} else {
> >  		unsigned int duration_us;
> >  
> > -		tick_nohz_idle_go_idle(true);
> > -		rcu_idle_enter();
> > -
> >  		/*
> >  		 * Ask the cpuidle framework to choose a convenient
> > idle state.
> >  		 */
> >  		next_state = cpuidle_select(drv, dev, &duration_us);
> > +
> > +		tick_nohz_idle_go_idle(duration_us > USEC_PER_SEC /
> > HZ);
> > +		rcu_idle_enter();
> > +
> >  		entered_state = call_cpuidle(drv, dev, next_state);
> 
> When the expected idle period is short enough
> that the timer is not stopped, does it make
> sense to still call rcu_idle_enter?
> 
> Should rcu_idle_enter also be conditional on
> the expected idle period?

Well, that would be the next step. :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-06  2:15             ` Li, Aubrey
@ 2018-03-06  8:45               ` Peter Zijlstra
  2018-03-06 14:07                 ` Li, Aubrey
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-03-06  8:45 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Mike Galbraith, LKML, Linux PM

On Tue, Mar 06, 2018 at 10:15:10AM +0800, Li, Aubrey wrote:
> On 2018/3/5 21:53, Peter Zijlstra wrote:
> > On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
> >> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> > 
> >>>> IOW, the target residency of the selected state doesn't tell you how
> >>>> much time you should expect to be idle in general.
> >>>
> >>> Right, but I think that measure isn't of primary relevance. What we want
> >>> to know is: 'should I stop the tick' and 'what C state do I go to'.
> 
> I understood the benefit of mapping duration to state number, is duration <->
> state number mapping a generic solution to all arches?

Yes, all platforms have a limited set of possible idle states.

> Back to the user's concern is, "I'm running a latency sensitive application, and
> I want idle switching ASAP". So I think the user may not care about what C state
> to go into, that is, even if a deeper state has chance to go, the user striving
> for a higher workload score may still not want it?

The user caring about performance very much cares about the actual idle
state too, exit latency for deeper states is horrific and will screw
them up just as much as the whole nohz timer reprogramming does.

We can basically view the whole nohz thing as an additional entry/exit
latency for the idle state, which is why I don't think its weird to
couple them.

> >> Maybe just return a "nohz" indicator from cpuidle_select() in addition
> >> to the state index and make the decision in the governor?
> > 
> > Much better option than returning a duration :-)
> >
> So what does "nohz = disable and state index = deepest" mean? This combination
> does not make sense for performance only purpose?

I tend to agree with you that the state space allowed by a separate
variable is larger than required, but it's significantly smaller than
preserving 'time' so I can live with it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection
  2018-03-06  8:45               ` Peter Zijlstra
@ 2018-03-06 14:07                 ` Li, Aubrey
  0 siblings, 0 replies; 32+ messages in thread
From: Li, Aubrey @ 2018-03-06 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Mike Galbraith, LKML, Linux PM

On 2018/3/6 16:45, Peter Zijlstra wrote:
> On Tue, Mar 06, 2018 at 10:15:10AM +0800, Li, Aubrey wrote:
>> On 2018/3/5 21:53, Peter Zijlstra wrote:
>>> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
>>>
>>>>>> IOW, the target residency of the selected state doesn't tell you how
>>>>>> much time you should expect to be idle in general.
>>>>>
>>>>> Right, but I think that measure isn't of primary relevance. What we want
>>>>> to know is: 'should I stop the tick' and 'what C state do I go to'.
>>
>> I understood the benefit of mapping duration to state number, is duration <->
>> state number mapping a generic solution to all arches?
> 
> Yes, all platforms have a limited set of possible idle states.
> 
>> Back to the user's concern is, "I'm running a latency sensitive application, and
>> I want idle switching ASAP". So I think the user may not care about what C state
>> to go into, that is, even if a deeper state has chance to go, the user striving
>> for a higher workload score may still not want it?
> 
> The user caring about performance very much cares about the actual idle
> state too, exit latency for deeper states is horrific and will screw
> them up just as much as the whole nohz timer reprogramming does.
> 
> We can basically view the whole nohz thing as an additional entry/exit
> latency for the idle state, which is why I don't think its weird to
> couple them.

okay, I see your point now. Thanks!

> 
>>>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>>>> to the state index and make the decision in the governor?
>>>
>>> Much better option than returning a duration :-)
>>>
>> So what does "nohz = disable and state index = deepest" mean? This combination
>> does not make sense for performance only purpose?
> 
> I tend to agree with you that the state space allowed by a separate
> variable is larger than required, but it's significantly smaller than
> preserving 'time' so I can live with it.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-03-06 14:07 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-04 22:21 [RFC/RFT][PATCH 0/7] sched/cpuidle: Idle loop rework Rafael J. Wysocki
2018-03-04 22:24 ` [RFC/RFT][PATCH 1/7] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
2018-03-05 10:44   ` Peter Zijlstra
2018-03-05 11:26     ` Rafael J. Wysocki
2018-03-04 22:24 ` [RFC/RFT][PATCH 2/7] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
2018-03-04 22:24 ` [RFC/RFT][PATCH 3/7] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
2018-03-04 22:26 ` [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection Rafael J. Wysocki
2018-03-05 11:38   ` Peter Zijlstra
2018-03-05 11:47     ` Rafael J. Wysocki
2018-03-05 12:50       ` Peter Zijlstra
2018-03-05 13:05         ` Rafael J. Wysocki
2018-03-05 13:53           ` Peter Zijlstra
2018-03-06  2:15             ` Li, Aubrey
2018-03-06  8:45               ` Peter Zijlstra
2018-03-06 14:07                 ` Li, Aubrey
2018-03-04 22:27 ` [RFC/RFT][PATCH 5/7] cpuidle: New governor callback for predicting idle duration Rafael J. Wysocki
2018-03-04 22:28 ` [RFC/RFT][PATCH 6/7] sched: idle: Predict idle duration before stopping the tick Rafael J. Wysocki
2018-03-05 11:45   ` Peter Zijlstra
2018-03-05 11:50     ` Rafael J. Wysocki
2018-03-05 12:07       ` Rafael J. Wysocki
2018-03-05 12:42         ` Peter Zijlstra
2018-03-05 13:00           ` Rafael J. Wysocki
2018-03-05 12:35   ` Peter Zijlstra
2018-03-05 12:56     ` Rafael J. Wysocki
2018-03-05 13:19     ` Rik van Riel
2018-03-05 13:37       ` Peter Zijlstra
2018-03-05 13:46         ` Peter Zijlstra
2018-03-05 15:36   ` Thomas Ilsche
2018-03-05 16:50     ` Peter Zijlstra
2018-03-05 23:27   ` Rik van Riel
2018-03-06  8:18     ` Rafael J. Wysocki
2018-03-04 22:29 ` [RFC/RFT][PATCH 7/7] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).