All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework
@ 2018-03-06  8:57 Rafael J. Wysocki
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  8:57 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

Hi All,

Thanks a lot for the discussion so far!

Here's a new version of the series addressing some comments from the
discussion and (most importantly) replacing patches 4 and 5 with another
(simpler) patch.

The summary below still applies:

On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote:
> 
> The problem is that if we stop the sched tick in
> tick_nohz_idle_enter() and then the idle governor predicts short idle
> duration, we lose regardless of whether or not it is right.
> 
> If it is right, we've lost already, because we stopped the tick
> unnecessarily.  If it is not right, we'll lose going forward, because
> the idle state selected by the governor is going to be too shallow and
> we'll draw too much power (that has been reported recently to actually
> happen often enough for people to care).
> 
> This patch series is an attempt to improve the situation and the idea
> here is to make the decision whether or not to stop the tick deeper in
> the idle loop and in particular after running the idle state selection
> in the path where the idle governor is invoked.  This way the problem
> can be avoided, because the idle duration predicted by the idle governor
> can be used to decide whether or not to stop the tick so that the tick
> is only stopped if that value is large enough (and, consequently, the
> idle state selected by the governor is deep enough).
> 
> The series tires to avoid adding too much new code, rather reorder the
> existing code and make it more fine-grained.
> 
> Patch 1 prepares the tick-sched code for the subsequent modifications and it
> doesn't change the code's functionality (at least not intentionally).
> 
> Patch 2 starts pushing the tick stopping decision deeper into the idle
> loop, but it is limited to do_idle() and tick_nohz_irq_exit().
> 
> Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
> and sets the stage for the changes in patch 6.

Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select
governor callback allowing them to return a "nohz" hint on whether or not to
stop the tick to the caller.

Patch 5 reorders the idle state selection with respect to the stopping of the
tick and causes the additional "nohz" hint from cpuidle_select() to be used
for deciding whether or not to stop the tick.

Patch 6 cleans up the code to avoid running one piece of it twice in a row
in some cases.

And the two paragraphs below still apply:

> I have tested these patches on a couple of machines, including the very laptop
> I'm sending them from, without any obvious issues, but please give them a go
> if you can, especially if you have an easy way to reproduce the problem they
> are targeting.  The patches are on top of 4.16-rc3 (if you need a git branch
> with them for easier testing, please let me know).
> 
> The above said, this is just RFC, so no pets close to the machines running it,
> please, and I'm kind of expecting Peter and Thomas to tear it into pieces. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
@ 2018-03-06  9:02 ` Rafael J. Wysocki
  2018-03-07 23:18   ` Frederic Weisbecker
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:02 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Prepare two pieces of code in tick_nohz_idle_enter() for being
called separately from each other.

First, make it possible to call the initial preparatory part of
tick_nohz_idle_enter() without the tick-stopping part following
it and introduce the tick_nohz_idle_prepare() wrapper for that
(that will be used in the next set of changes).

Second, add a new stop_tick argument to __tick_nohz_idle_enter()
tell it whether or not to stop the tick (that is always set for
now) and add a wrapper allowing this function to be called from
the outside of tick-sched.c.

Just the code reorganization and two new wrapper functions, no
intended functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: Eliminate assumetry in enabling/disabling interrupts.

---
 include/linux/tick.h     |    2 +
 kernel/time/tick-sched.c |   62 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 45 insertions(+), 19 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,8 @@ enum tick_dep_bits {
 #ifdef CONFIG_NO_HZ_COMMON
 extern bool tick_nohz_enabled;
 extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_prepare(void);
+extern void tick_nohz_idle_go_idle(bool stop_tick);
 extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -911,14 +911,14 @@ static bool can_stop_idle_tick(int cpu,
 	return true;
 }
 
-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
 	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
 	now = tick_nohz_start_idle(ts);
 
-	if (can_stop_idle_tick(cpu, ts)) {
+	if (can_stop_idle_tick(cpu, ts) && stop_tick) {
 		int was_stopped = ts->tick_stopped;
 
 		ts->idle_calls++;
@@ -936,22 +936,8 @@ static void __tick_nohz_idle_enter(struc
 	}
 }
 
-/**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
- *
- * When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
- *
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- *  to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
- */
-void tick_nohz_idle_enter(void)
+void __tick_nohz_idle_prepare(void)
 {
-	struct tick_sched *ts;
-
 	lockdep_assert_irqs_enabled();
 	/*
 	 * Update the idle state in the scheduler domain hierarchy
@@ -960,12 +946,50 @@ void tick_nohz_idle_enter(void)
 	 * exiting idle.
 	 */
 	set_cpu_sd_state_idle();
+}
+
+/**
+ * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
+ *
+ * Called when we start the idle loop.
+ */
+void tick_nohz_idle_prepare(void)
+{
+	struct tick_sched *ts;
+
+	__tick_nohz_idle_prepare();
+
+	local_irq_disable();
+
+	ts = this_cpu_ptr(&tick_cpu_sched);
+	ts->inidle = 1;
+
+	local_irq_enable();
+}
+
+/**
+ * tick_nohz_idle_go_idle - start idle period on the current CPU.
+ * @stop_tick: Whether or not to stop the idle tick.
+ *
+ * When @stop_tick is set and the next event is more than a tick into the
+ * future, stop the idle tick.
+ */
+void tick_nohz_idle_go_idle(bool stop_tick)
+{
+	__tick_nohz_idle_enter(this_cpu_ptr(&tick_cpu_sched), stop_tick);
+}
+
+void tick_nohz_idle_enter(void)
+{
+	struct tick_sched *ts;
+
+	__tick_nohz_idle_prepare();
 
 	local_irq_disable();
 
 	ts = this_cpu_ptr(&tick_cpu_sched);
 	ts->inidle = 1;
-	__tick_nohz_idle_enter(ts);
+	__tick_nohz_idle_enter(ts, true);
 
 	local_irq_enable();
 }
@@ -983,7 +1007,7 @@ void tick_nohz_irq_exit(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (ts->inidle)
-		__tick_nohz_idle_enter(ts);
+		__tick_nohz_idle_enter(ts, true);
 	else
 		tick_nohz_full_update_tick(ts);
 }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-06  9:02 ` Rafael J. Wysocki
  2018-03-07 23:39   ` Frederic Weisbecker
  2018-03-06  9:03 ` [RFC/RFT][PATCH v2 3/6] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:02 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the timekeeping code on the duration
of the upcoming idle period.  Specifically, if the tick has been
stopped and the idle governor predicts short idle, the situation is
bad regardless of whether or not the prediction is accurate.  If it
is accurate, the tick has been stopped unnecessarily which means
excessive overhead.  If it is not accurate, the CPU is likely to
spend too much time in the (shallow, because short idle has been
predicted) idle state selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
information to whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: No changes.

---
 kernel/sched/idle.c      |   13 ++++++++++---
 kernel/time/tick-sched.c |    2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -220,13 +220,17 @@ static void do_idle(void)
 	 */
 
 	__current_set_polling();
-	tick_nohz_idle_enter();
+	tick_nohz_idle_prepare();
 
 	while (!need_resched()) {
 		check_pgt_cache();
 		rmb();
 
 		if (cpu_is_offline(cpu)) {
+			local_irq_disable();
+			tick_nohz_idle_go_idle(true);
+			local_irq_enable();
+
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
@@ -240,10 +244,13 @@ static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired())
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			tick_nohz_idle_go_idle(false);
 			cpu_idle_poll();
-		else
+		} else {
+			tick_nohz_idle_go_idle(true);
 			cpuidle_idle_call();
+		}
 		arch_cpu_idle_exit();
 	}
 
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1007,7 +1007,7 @@ void tick_nohz_irq_exit(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (ts->inidle)
-		__tick_nohz_idle_enter(ts, true);
+		__tick_nohz_idle_enter(ts, false);
 	else
 		tick_nohz_full_update_tick(ts);
 }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 3/6] sched: idle: Do not stop the tick before cpuidle_idle_call()
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
@ 2018-03-06  9:03 ` Rafael J. Wysocki
  2018-03-06  9:05 ` [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:03 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, there are code paths in cpuidle_idle_call() that don't need
the tick to be stopped.  In particular, the cpuidle_enter_s2idle()
path deals with the tick (and with the entire timekeeping for that
matter) by itself and it doesn't need the tick to be stopped
beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of some governor code with respect
to some code in tick_nohz_idle_go_idle().  For this purpose, call
tick_nohz_idle_go_idle() in the same branch as cpuidle_select().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: No changes.

---
 kernel/sched/idle.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,14 @@ static void cpuidle_idle_call(void)
 	}
 
 	/*
-	 * Tell the RCU framework we are entering an idle section,
-	 * so no more rcu read side critical sections and one more
+	 * The RCU framework needs to be told that we are entering an idle
+	 * section, so no more rcu read side critical sections and one more
 	 * step to the grace period
 	 */
-	rcu_idle_enter();
 
 	if (cpuidle_not_available(drv, dev)) {
+		tick_nohz_idle_go_idle(false);
+		rcu_idle_enter();
 		default_idle_call();
 		goto exit_idle;
 	}
@@ -169,6 +170,9 @@ static void cpuidle_idle_call(void)
 
 	if (idle_should_enter_s2idle() || dev->use_deepest_state) {
 		if (idle_should_enter_s2idle()) {
+			tick_nohz_idle_go_idle(false);
+			rcu_idle_enter();
+
 			entered_state = cpuidle_enter_s2idle(drv, dev);
 			if (entered_state > 0) {
 				local_irq_enable();
@@ -176,9 +180,15 @@ static void cpuidle_idle_call(void)
 			}
 		}
 
+		tick_nohz_idle_go_idle(true);
+		rcu_idle_enter();
+
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		tick_nohz_idle_go_idle(true);
+		rcu_idle_enter();
+
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
@@ -248,7 +258,6 @@ static void do_idle(void)
 			tick_nohz_idle_go_idle(false);
 			cpu_idle_poll();
 		} else {
-			tick_nohz_idle_go_idle(true);
 			cpuidle_idle_call();
 		}
 		arch_cpu_idle_exit();

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2018-03-06  9:03 ` [RFC/RFT][PATCH v2 3/6] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
@ 2018-03-06  9:05 ` Rafael J. Wysocki
  2018-03-06  9:28   ` Rafael J. Wysocki
  2018-03-06 10:06   ` [Update][RFC/RFT][PATCH " Rafael J. Wysocki
  2018-03-06  9:10 ` [RFC/RFT][PATCH v2 5/6] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0,
 (2) the selected state is a polling one, or
 (3) the target residency of the selected state is less than the
     length of the tick period and there is at least one deeper
     idle state available within the tick period range.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: New patch.

---
 drivers/cpuidle/cpuidle.c          |   10 ++++++++--
 drivers/cpuidle/governors/ladder.c |    3 ++-
 drivers/cpuidle/governors/menu.c   |   26 ++++++++++++++++++++++----
 include/linux/cpuidle.h            |    6 ++++--
 kernel/sched/idle.c                |    4 +++-
 5 files changed, 39 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
 				  struct cpuidle_device *dev);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev,
+			  bool *nohz_ret);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -246,7 +247,8 @@ struct cpuidle_governor {
 					struct cpuidle_device *dev);
 
 	int  (*select)		(struct cpuidle_driver *drv,
-					struct cpuidle_device *dev);
+					struct cpuidle_device *dev,
+					bool *nohz_ret);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
 };
 
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -186,13 +186,15 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		bool nohz = true;
+
 		tick_nohz_idle_go_idle(true);
 		rcu_idle_enter();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
-		next_state = cpuidle_select(drv, dev);
+		next_state = cpuidle_select(drv, dev, &nohz);
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
  *
  * Returns the index of the idle state.  The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   bool *nohz_ret)
 {
-	return cpuidle_curr_governor->select(drv, dev);
+	return cpuidle_curr_governor->select(drv, dev, nohz_ret);
 }
 
 /**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @dummy: not used
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-				struct cpuidle_device *dev)
+			       struct cpuidle_device *dev, bool *dummy)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -279,8 +279,10 @@ again:
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
  */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		       bool *nohz_ret)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +305,10 @@ static int menu_select(struct cpuidle_dr
 		latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
+	if (unlikely(latency_req == 0)) {
+		*nohz_ret = false;
 		return 0;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -367,10 +371,12 @@ static int menu_select(struct cpuidle_dr
 			continue;
 		if (idx == -1)
 			idx = i; /* first enabled state */
-		if (s->target_residency > data->predicted_us)
-			break;
 		if (s->exit_latency > latency_req)
 			break;
+		if (s->target_residency > data->predicted_us) {
+			first_idx = i;
+			break;
+		}
 
 		idx = i;
 	}
@@ -378,6 +384,18 @@ static int menu_select(struct cpuidle_dr
 	if (idx == -1)
 		idx = 0; /* No states enabled. Must use 0. */
 
+	if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+		*nohz_ret = false;
+	} else if (drv->states[idx].target_residency < TICK_USEC) {
+		/*
+		 * Do not stop the tick if there is at least one more state
+		 * within the tick period range that could be used if longer
+		 * idle duration was predicted.
+		 */
+		*nohz_ret = first_idx > idx &&
+			    drv->states[first_idx].target_residency < TICK_USEC;
+	}
+
 	data->last_state_idx = idx;
 
 	return data->last_state_idx;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 5/6] sched: idle: Select idle state before stopping the tick
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2018-03-06  9:05 ` [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
@ 2018-03-06  9:10 ` Rafael J. Wysocki
  2018-03-06  9:10 ` [RFC/RFT][PATCH v2 6/6] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki
  2018-03-08 10:31 ` [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Mike Galbraith
  6 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned
by cpuidle_select() to tell tick_nohz_idle_go_idle() whether or not
to stop the tick.

This isn't straightforward, because menu_predict() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, notice that tick_nohz_stop_sched_tick() already computes the
next timer event time to reprogram the scheduler tick hrtimer and
that time can be used as a proxy for the actual next timer event
time in the idle duration predicition.

Accordingly, rename the original tick_nohz_stop_sched_tick() to
__tick_nohz_next_event() and add the stop_tick argument indicating
whether or not to stop the tick to it.  If that argument is 'true',
the function will work like the original tick_nohz_stop_sched_tick(),
but otherwise it will just compute the next event time without
stopping the tick.  Next, redefine tick_nohz_stop_sched_tick() as
a wrapper around the new function.

Following that, make tick_nohz_get_sleep_length() call
__tick_nohz_next_event() to compute the next timer event time
and make it use the new last_jiffies_update field in struct
tick_sched to tell __tick_nohz_idle_enter() to skip some code
that has run already.

[After this change the __tick_nohz_next_event() code computing the
 next event time will run twice in a row if the expected idle period
 duration coming from cpuidle_select() is large enough which is sort
 of ugly, but the next set of changes deals with that separately.
 To do that, it uses the value of the last_jiffies_update field in
 struct tick_sched introduced here, among other things.]

Finally, drop the now redundant sleep_length field from struct
tick_sched.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: Use the "nohz" hint from cpuidle_select() instead of the expected
       idle duration.

---
 kernel/sched/idle.c      |    7 ++---
 kernel/time/tick-sched.c |   64 +++++++++++++++++++++++++++++++++--------------
 kernel/time/tick-sched.h |    3 --
 3 files changed, 50 insertions(+), 24 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
 	} else {
 		bool nohz = true;
 
-		tick_nohz_idle_go_idle(true);
-		rcu_idle_enter();
-
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
 		next_state = cpuidle_select(drv, dev, &nohz);
+
+		tick_nohz_idle_go_idle(nohz);
+		rcu_idle_enter();
+
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,8 +655,8 @@ static inline bool local_timer_softirq_p
 	return local_softirq_pending() & TIMER_SOFTIRQ;
 }
 
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
-					 ktime_t now, int cpu)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
+				      bool stop_tick)
 {
 	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
@@ -670,6 +670,7 @@ static ktime_t tick_nohz_stop_sched_tick
 		basejiff = jiffies;
 	} while (read_seqretry(&jiffies_lock, seq));
 	ts->last_jiffies = basejiff;
+	ts->last_jiffies_update = basemono;
 
 	/*
 	 * Keep the periodic tick, when RCU, architecture or irq_work
@@ -732,8 +733,10 @@ static ktime_t tick_nohz_stop_sched_tick
 	 */
 	delta = timekeeping_max_deferment();
 	if (cpu == tick_do_timer_cpu) {
-		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-		ts->do_timer_last = 1;
+		if (stop_tick) {
+			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+			ts->do_timer_last = 1;
+		}
 	} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
 		delta = KTIME_MAX;
 		ts->do_timer_last = 0;
@@ -756,6 +759,12 @@ static ktime_t tick_nohz_stop_sched_tick
 	expires = min_t(u64, expires, next_tick);
 	tick = expires;
 
+	if (!stop_tick) {
+		/* Undo the effect of get_next_timer_interrupt(). */
+		timer_clear_idle();
+		goto out;
+	}
+
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
@@ -804,14 +813,14 @@ static ktime_t tick_nohz_stop_sched_tick
 	else
 		tick_program_event(tick, 1);
 out:
-	/*
-	 * Update the estimated sleep length until the next timer
-	 * (not only the tick).
-	 */
-	ts->sleep_length = ktime_sub(dev->next_event, now);
 	return tick;
 }
 
+static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+	return __tick_nohz_next_event(ts, cpu, true);
+}
+
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
@@ -847,7 +856,7 @@ static void tick_nohz_full_update_tick(s
 		return;
 
 	if (can_stop_full_tick(cpu, ts))
-		tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+		tick_nohz_stop_sched_tick(ts, cpu);
 	else if (ts->tick_stopped)
 		tick_nohz_restart_sched_tick(ts, ktime_get());
 #endif
@@ -873,10 +882,8 @@ static bool can_stop_idle_tick(int cpu,
 		return false;
 	}
 
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
-		ts->sleep_length = NSEC_PER_SEC / HZ;
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
 		return false;
-	}
 
 	if (need_resched())
 		return false;
@@ -913,17 +920,22 @@ static bool can_stop_idle_tick(int cpu,
 
 static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
-	ktime_t now, expires;
 	int cpu = smp_processor_id();
 
-	now = tick_nohz_start_idle(ts);
+	if (!ts->last_jiffies_update) {
+		/* tick_nohz_get_sleep_length() has not run. */
+		tick_nohz_start_idle(ts);
+		if (!can_stop_idle_tick(cpu, ts))
+			return;
+	}
 
-	if (can_stop_idle_tick(cpu, ts) && stop_tick) {
+	if (stop_tick) {
 		int was_stopped = ts->tick_stopped;
+		ktime_t expires;
 
 		ts->idle_calls++;
 
-		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		expires = tick_nohz_stop_sched_tick(ts, cpu);
 		if (expires > 0LL) {
 			ts->idle_sleeps++;
 			ts->idle_expires = expires;
@@ -934,6 +946,8 @@ static void __tick_nohz_idle_enter(struc
 			nohz_balance_enter_idle(cpu);
 		}
 	}
+
+	ts->last_jiffies_update = 0;
 }
 
 void __tick_nohz_idle_prepare(void)
@@ -1013,15 +1027,27 @@ void tick_nohz_irq_exit(void)
 }
 
 /**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
  *
  * Called from power state control code with interrupts disabled
  */
 ktime_t tick_nohz_get_sleep_length(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+	int cpu = smp_processor_id();
+	ktime_t now, next_event;
 
-	return ts->sleep_length;
+	now = tick_nohz_start_idle(ts);
+
+	if (can_stop_idle_tick(cpu, ts)) {
+		next_event = __tick_nohz_next_event(ts, cpu, false);
+	} else {
+		struct clock_event_device *dev;
+
+		dev = __this_cpu_read(tick_cpu_device.evtdev);
+		next_event = dev->next_event;
+	}
+	return ktime_sub(next_event, now);;
 }
 
 /**
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -38,7 +38,6 @@ enum tick_nohz_mode {
  * @idle_exittime:	Time when the idle state was left
  * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
  * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length:	Duration of the current idle sleep
  * @do_timer_lst:	CPU was the last one doing do_timer before going idle
  */
 struct tick_sched {
@@ -58,8 +57,8 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
-	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
+	u64				last_jiffies_update;
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC/RFT][PATCH v2 6/6] time: tick-sched: Avoid running the same code twice in a row
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2018-03-06  9:10 ` [RFC/RFT][PATCH v2 5/6] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
@ 2018-03-06  9:10 ` Rafael J. Wysocki
  2018-03-08 10:31 ` [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Mike Galbraith
  6 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

To avoid running the same piece of code twice in a row, move the
tick stopping part of __tick_nohz_next_event() into a new function
called __tick_nohz_stop_tick() and invoke them both separately.

Make __tick_nohz_idle_enter() avoid calling __tick_nohz_next_event()
if it has been called already by tick_nohz_get_sleep_length() and
use the new next_idle_tick field in struct tick_sched to pass the
next event time value between tick_nohz_get_sleep_length() and
__tick_nohz_idle_enter().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: No changes.

---
 kernel/time/tick-sched.c |  130 ++++++++++++++++++++++++++---------------------
 kernel/time/tick-sched.h |    1 
 2 files changed, 73 insertions(+), 58 deletions(-)

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,13 +655,10 @@ static inline bool local_timer_softirq_p
 	return local_softirq_pending() & TIMER_SOFTIRQ;
 }
 
-static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
-				      bool stop_tick)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
 	unsigned long seq, basejiff;
-	ktime_t	tick;
 
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
@@ -714,34 +711,23 @@ static ktime_t __tick_nohz_next_event(st
 		 * We've not stopped the tick yet, and there's a timer in the
 		 * next period, so no point in stopping it either, bail.
 		 */
-		if (!ts->tick_stopped) {
-			tick = 0;
-			goto out;
-		}
+		if (!ts->tick_stopped)
+			return 0;
 	}
 
 	/*
-	 * If this CPU is the one which updates jiffies, then give up
-	 * the assignment and let it be taken by the CPU which runs
-	 * the tick timer next, which might be this CPU as well. If we
-	 * don't drop this here the jiffies might be stale and
-	 * do_timer() never invoked. Keep track of the fact that it
-	 * was the one which had the do_timer() duty last. If this CPU
-	 * is the one which had the do_timer() duty last, we limit the
-	 * sleep time to the timekeeping max_deferment value.
+	 * If this CPU is the one which had the do_timer() duty last, we limit
+	 * the sleep time to the timekeeping max_deferment value.
 	 * Otherwise we can sleep as long as we want.
 	 */
 	delta = timekeeping_max_deferment();
-	if (cpu == tick_do_timer_cpu) {
-		if (stop_tick) {
-			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
-			ts->do_timer_last = 1;
+	if (cpu != tick_do_timer_cpu) {
+		if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+			delta = KTIME_MAX;
+			ts->do_timer_last = 0;
+		} else if (!ts->do_timer_last) {
+			delta = KTIME_MAX;
 		}
-	} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
-		delta = KTIME_MAX;
-		ts->do_timer_last = 0;
-	} else if (!ts->do_timer_last) {
-		delta = KTIME_MAX;
 	}
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -756,24 +742,37 @@ static ktime_t __tick_nohz_next_event(st
 	else
 		expires = KTIME_MAX;
 
-	expires = min_t(u64, expires, next_tick);
-	tick = expires;
+	ts->next_idle_tick = min_t(u64, expires, next_tick);
+	return ts->next_idle_tick;
+}
 
-	if (!stop_tick) {
-		/* Undo the effect of get_next_timer_interrupt(). */
-		timer_clear_idle();
-		goto out;
+static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu, u64 expires)
+{
+	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+	ktime_t tick = expires;
+
+	/*
+	 * If this CPU is the one which updates jiffies, then give up
+	 * the assignment and let it be taken by the CPU which runs
+	 * the tick timer next, which might be this CPU as well. If we
+	 * don't drop this here the jiffies might be stale and
+	 * do_timer() never invoked. Keep track of the fact that it
+	 * was the one which had the do_timer() duty last.
+	 */
+	if (cpu == tick_do_timer_cpu) {
+		tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+		ts->do_timer_last = 1;
 	}
 
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
 		if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
-			goto out;
+			return;
 
 		WARN_ON_ONCE(1);
 		printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
-			    basemono, ts->next_tick, dev->next_event,
+			    ts->last_jiffies_update, ts->next_tick, dev->next_event,
 			    hrtimer_active(&ts->sched_timer), hrtimer_get_expires(&ts->sched_timer));
 	}
 
@@ -803,7 +802,7 @@ static ktime_t __tick_nohz_next_event(st
 	if (unlikely(expires == KTIME_MAX)) {
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
 			hrtimer_cancel(&ts->sched_timer);
-		goto out;
+		return;
 	}
 
 	hrtimer_set_expires(&ts->sched_timer, tick);
@@ -812,14 +811,18 @@ static ktime_t __tick_nohz_next_event(st
 		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
 	else
 		tick_program_event(tick, 1);
-out:
-	return tick;
 }
 
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
 {
-	return __tick_nohz_next_event(ts, cpu, true);
+	u64 next_tick;
+
+	next_tick = __tick_nohz_next_event(ts, cpu);
+	if (next_tick)
+		__tick_nohz_stop_tick(ts, cpu, next_tick);
 }
+#endif /* CONFIG_NO_HZ_FULL */
 
 static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
@@ -921,32 +924,43 @@ static bool can_stop_idle_tick(int cpu,
 static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 {
 	int cpu = smp_processor_id();
+	ktime_t expires;
+	int was_stopped;
 
-	if (!ts->last_jiffies_update) {
-		/* tick_nohz_get_sleep_length() has not run. */
+	if (ts->last_jiffies_update) {
+		if (!stop_tick)
+			goto out;
+
+		/*
+		 * tick_nohz_get_sleep_length() has run, so the tick timer
+		 * expiration time has been computed already.
+		 */
+		expires = ts->next_idle_tick;
+	} else {
 		tick_nohz_start_idle(ts);
-		if (!can_stop_idle_tick(cpu, ts))
+		if (!can_stop_idle_tick(cpu, ts) || !stop_tick)
 			return;
+
+		expires = __tick_nohz_next_event(ts, cpu);
 	}
 
-	if (stop_tick) {
-		int was_stopped = ts->tick_stopped;
-		ktime_t expires;
-
-		ts->idle_calls++;
-
-		expires = tick_nohz_stop_sched_tick(ts, cpu);
-		if (expires > 0LL) {
-			ts->idle_sleeps++;
-			ts->idle_expires = expires;
-		}
+	ts->idle_calls++;
 
-		if (!was_stopped && ts->tick_stopped) {
-			ts->idle_jiffies = ts->last_jiffies;
-			nohz_balance_enter_idle(cpu);
-		}
+	was_stopped = ts->tick_stopped;
+
+	if (expires > 0LL) {
+		__tick_nohz_stop_tick(ts, cpu, expires);
+
+		ts->idle_sleeps++;
+		ts->idle_expires = expires;
 	}
 
+	if (!was_stopped && ts->tick_stopped) {
+		ts->idle_jiffies = ts->last_jiffies;
+		nohz_balance_enter_idle(cpu);
+	}
+
+out:
 	ts->last_jiffies_update = 0;
 }
 
@@ -955,7 +969,7 @@ void __tick_nohz_idle_prepare(void)
 	lockdep_assert_irqs_enabled();
 	/*
 	 * Update the idle state in the scheduler domain hierarchy
-	 * when tick_nohz_stop_sched_tick() is called from the idle loop.
+	 * when __tick_nohz_stop_tick() is called from the idle loop.
 	 * State will be updated to busy during the first busy tick after
 	 * exiting idle.
 	 */
@@ -1040,7 +1054,7 @@ ktime_t tick_nohz_get_sleep_length(void)
 	now = tick_nohz_start_idle(ts);
 
 	if (can_stop_idle_tick(cpu, ts)) {
-		next_event = __tick_nohz_next_event(ts, cpu, false);
+		next_event = __tick_nohz_next_event(ts, cpu);
 	} else {
 		struct clock_event_device *dev;
 
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -59,6 +59,7 @@ struct tick_sched {
 	ktime_t				iowait_sleeptime;
 	unsigned long			last_jiffies;
 	u64				last_jiffies_update;
+	u64				next_idle_tick;
 	u64				next_timer;
 	ktime_t				idle_expires;
 	int				do_timer_last;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-06  9:05 ` [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
@ 2018-03-06  9:28   ` Rafael J. Wysocki
  2018-03-06 10:06   ` [Update][RFC/RFT][PATCH " Rafael J. Wysocki
  1 sibling, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06  9:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

Bummer. :-(

On Tue, Mar 6, 2018 at 10:05 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> +       if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
> +               *nohz_ret = false;
> +       } else if (drv->states[idx].target_residency < TICK_USEC) {
> +               /*
> +                * Do not stop the tick if there is at least one more state
> +                * within the tick period range that could be used if longer
> +                * idle duration was predicted.
> +                */
> +               *nohz_ret = first_idx > idx &&
> +                           drv->states[first_idx].target_residency < TICK_USEC;

This is reversed, sent a wrong version of the patch.

I'll resend with this fixed shortly.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Update][RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()
  2018-03-06  9:05 ` [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
  2018-03-06  9:28   ` Rafael J. Wysocki
@ 2018-03-06 10:06   ` Rafael J. Wysocki
  1 sibling, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-06 10:06 UTC (permalink / raw)
  To: Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li,
	Mike Galbraith, LKML

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0,
 (2) the selected state is a polling one, or
 (3) the target residency of the selected state is less than the
     length of the tick period and there is at least one deeper
     idle state available within the tick period range.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Update: Fix a reversed value of *nohz_ret in menu_select().

---
 drivers/cpuidle/cpuidle.c          |   10 ++++++++--
 drivers/cpuidle/governors/ladder.c |    3 ++-
 drivers/cpuidle/governors/menu.c   |   26 ++++++++++++++++++++++----
 include/linux/cpuidle.h            |    6 ++++--
 kernel/sched/idle.c                |    4 +++-
 5 files changed, 39 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
 				  struct cpuidle_device *dev);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev,
+			  bool *nohz_ret);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -246,7 +247,8 @@ struct cpuidle_governor {
 					struct cpuidle_device *dev);
 
 	int  (*select)		(struct cpuidle_driver *drv,
-					struct cpuidle_device *dev);
+					struct cpuidle_device *dev,
+					bool *nohz_ret);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
 };
 
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -186,13 +186,15 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		bool nohz = true;
+
 		tick_nohz_idle_go_idle(true);
 		rcu_idle_enter();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
-		next_state = cpuidle_select(drv, dev);
+		next_state = cpuidle_select(drv, dev, &nohz);
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
  *
  * Returns the index of the idle state.  The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   bool *nohz_ret)
 {
-	return cpuidle_curr_governor->select(drv, dev);
+	return cpuidle_curr_governor->select(drv, dev, nohz_ret);
 }
 
 /**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @dummy: not used
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-				struct cpuidle_device *dev)
+			       struct cpuidle_device *dev, bool *dummy)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -279,8 +279,10 @@ again:
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
  */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		       bool *nohz_ret)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +305,10 @@ static int menu_select(struct cpuidle_dr
 		latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
+	if (unlikely(latency_req == 0)) {
+		*nohz_ret = false;
 		return 0;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -367,10 +371,12 @@ static int menu_select(struct cpuidle_dr
 			continue;
 		if (idx == -1)
 			idx = i; /* first enabled state */
-		if (s->target_residency > data->predicted_us)
-			break;
 		if (s->exit_latency > latency_req)
 			break;
+		if (s->target_residency > data->predicted_us) {
+			first_idx = i;
+			break;
+		}
 
 		idx = i;
 	}
@@ -378,6 +384,18 @@ static int menu_select(struct cpuidle_dr
 	if (idx == -1)
 		idx = 0; /* No states enabled. Must use 0. */
 
+	if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+		*nohz_ret = false;
+	} else if (drv->states[idx].target_residency < TICK_USEC) {
+		/*
+		 * Do not stop the tick if there is at least one more state
+		 * within the tick period range that could be used if longer
+		 * idle duration was predicted.
+		 */
+		*nohz_ret = !(first_idx > idx &&
+			      drv->states[first_idx].target_residency < TICK_USEC);
+	}
+
 	data->last_state_idx = idx;
 
 	return data->last_state_idx;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
@ 2018-03-07 23:18   ` Frederic Weisbecker
  2018-03-08  9:22     ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2018-03-07 23:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> +
> +/**
> + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> + *
> + * Called when we start the idle loop.
> + */
> +void tick_nohz_idle_prepare(void)
> +{
> +	struct tick_sched *ts;
> +
> +	__tick_nohz_idle_prepare();
> +
> +	local_irq_disable();
> +
> +	ts = this_cpu_ptr(&tick_cpu_sched);
> +	ts->inidle = 1;
> +
> +	local_irq_enable();
> +}

Why not calling tick_nohz_start_idle() from there? This is going to
simplify the rest, you won't need to call tick_nohz_idle_go_idle()
from places that don't want to stop the tick and you can then remove
the stop_tick argument.

> +
> +/**
> + * tick_nohz_idle_go_idle - start idle period on the current CPU.
> + * @stop_tick: Whether or not to stop the idle tick.
> + *
> + * When @stop_tick is set and the next event is more than a tick into the
> + * future, stop the idle tick.
> + */
> +void tick_nohz_idle_go_idle(bool stop_tick)
> +{
> +	__tick_nohz_idle_enter(this_cpu_ptr(&tick_cpu_sched), stop_tick);
> +}
> +
> +void tick_nohz_idle_enter(void)
> +{
> +	struct tick_sched *ts;
> +
> +	__tick_nohz_idle_prepare();
>  
>  	local_irq_disable();
>  
>  	ts = this_cpu_ptr(&tick_cpu_sched);
>  	ts->inidle = 1;
> -	__tick_nohz_idle_enter(ts);
> +	__tick_nohz_idle_enter(ts, true);
>  
>  	local_irq_enable();
>  }

Ah I see you're keeping tick_nohz_idle_enter() around because of the callsite
in xen. It looks like a hack (from the xen part), I'll need to have a look at
it, just a note for myself...

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-06  9:02 ` [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
@ 2018-03-07 23:39   ` Frederic Weisbecker
  2018-03-08  9:05     ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2018-03-07 23:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Tue, Mar 06, 2018 at 10:02:15AM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -220,13 +220,17 @@ static void do_idle(void)
>  	 */
>  
>  	__current_set_polling();
> -	tick_nohz_idle_enter();
> +	tick_nohz_idle_prepare();

Since we leave tick_nohz_idle_exit() unchanged, can we keep tick_nohz_idle_prepare()
under the name tick_nohz_idle_enter() so that we stay symetric? And then make xen call
the two functions:

    tick_nohz_idle_enter();
    tick_nohz_idle_go_idle();

Also can we rename tick_nohz_idle_go_idle() to tick_nohz_idle_stop_tick() ?
This will be more self-explanatory.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop
  2018-03-07 23:39   ` Frederic Weisbecker
@ 2018-03-08  9:05     ` Rafael J. Wysocki
  0 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-08  9:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Thursday, March 8, 2018 12:39:12 AM CET Frederic Weisbecker wrote:
> On Tue, Mar 06, 2018 at 10:02:15AM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/idle.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/idle.c
> > +++ linux-pm/kernel/sched/idle.c
> > @@ -220,13 +220,17 @@ static void do_idle(void)
> >  	 */
> >  
> >  	__current_set_polling();
> > -	tick_nohz_idle_enter();
> > +	tick_nohz_idle_prepare();
> 
> Since we leave tick_nohz_idle_exit() unchanged, can we keep tick_nohz_idle_prepare()
> under the name tick_nohz_idle_enter() so that we stay symetric? And then make xen call
> the two functions:
> 
>     tick_nohz_idle_enter();
>     tick_nohz_idle_go_idle();

No problem with that.

> Also can we rename tick_nohz_idle_go_idle() to tick_nohz_idle_stop_tick() ?
> This will be more self-explanatory.

But it doesn't always stop the tick which is why I chose the other name.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-07 23:18   ` Frederic Weisbecker
@ 2018-03-08  9:22     ` Rafael J. Wysocki
  2018-03-08 15:14       ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-08  9:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/tick-sched.c
> > +++ linux-pm/kernel/time/tick-sched.c
> > +
> > +/**
> > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> > + *
> > + * Called when we start the idle loop.
> > + */
> > +void tick_nohz_idle_prepare(void)
> > +{
> > +	struct tick_sched *ts;
> > +
> > +	__tick_nohz_idle_prepare();
> > +
> > +	local_irq_disable();
> > +
> > +	ts = this_cpu_ptr(&tick_cpu_sched);
> > +	ts->inidle = 1;
> > +
> > +	local_irq_enable();
> > +}
> 
> Why not calling tick_nohz_start_idle() from there? This is going to
> simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> from places that don't want to stop the tick and you can then remove
> the stop_tick argument.

So I guess I would then use ts->idle_entrytime as "now" in the
tick_nohz_get_sleep_length() computation, right?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework
  2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2018-03-06  9:10 ` [RFC/RFT][PATCH v2 6/6] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki
@ 2018-03-08 10:31 ` Mike Galbraith
  2018-03-08 11:10   ` Rafael J. Wysocki
  6 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2018-03-08 10:31 UTC (permalink / raw)
  To: Rafael J. Wysocki, Peter Zijlstra, Linux PM
  Cc: Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li, LKML

On Tue, 2018-03-06 at 09:57 +0100, Rafael J. Wysocki wrote:
> Hi All,

Greetings,

> Thanks a lot for the discussion so far!
> 
> Here's a new version of the series addressing some comments from the
> discussion and (most importantly) replacing patches 4 and 5 with another
> (simpler) patch.

Oddity: these patches seemingly manage to cost a bit of power when
lightly loaded.  (but didn't cut cross core nohz cost much.. darn)

i4790 booted nopti nospectre_v2

30 sec tbench
4.16.0.g1b88acc-master (virgin)
Throughput 559.279 MB/sec  1 clients  1 procs  max_latency=0.046 ms
Throughput 997.119 MB/sec  2 clients  2 procs  max_latency=0.246 ms
Throughput 1693.04 MB/sec  4 clients  4 procs  max_latency=4.309 ms
Throughput 3597.2 MB/sec  8 clients  8 procs  max_latency=6.760 ms
Throughput 3474.55 MB/sec  16 clients  16 procs  max_latency=6.743 ms

4.16.0.g1b88acc-master (+v2)
Throughput 588.929 MB/sec  1 clients  1 procs  max_latency=0.291 ms
Throughput 1080.93 MB/sec  2 clients  2 procs  max_latency=0.639 ms
Throughput 1826.3 MB/sec  4 clients  4 procs  max_latency=0.647 ms
Throughput 3561.01 MB/sec  8 clients  8 procs  max_latency=1.279 ms
Throughput 3382.98 MB/sec  16 clients  16 procs  max_latency=4.817 ms

4.16.0.g1b88acc-master (+local nohz mitigation etc for reference [1])
Throughput 722.559 MB/sec  1 clients  1 procs  max_latency=0.087 ms
Throughput 1208.59 MB/sec  2 clients  2 procs  max_latency=0.289 ms
Throughput 2071.94 MB/sec  4 clients  4 procs  max_latency=0.654 ms
Throughput 3784.91 MB/sec  8 clients  8 procs  max_latency=0.974 ms
Throughput 3644.4 MB/sec  16 clients  16 procs  max_latency=5.620 ms

turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox

                        PkgWatt
                              1     2     3
4.16.0.g1b88acc-master     6.95  7.03  6.91 (virgin)
4.16.0.g1b88acc-master     7.20  7.25  7.26 (+v2)
4.16.0.g1b88acc-master     6.90  7.06  6.95 (+local)

Why would v2 charge the light firefox load a small but consistent fee?

	-Mike

1. see low end, that's largely due to nohz throttling

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework
  2018-03-08 10:31 ` [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Mike Galbraith
@ 2018-03-08 11:10   ` Rafael J. Wysocki
  2018-03-08 13:40     ` Mike Galbraith
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-08 11:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, LKML

On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <mgalbraith@suse.de> wrote:
> On Tue, 2018-03-06 at 09:57 +0100, Rafael J. Wysocki wrote:
>> Hi All,
>
> Greetings,

Hi,

>> Thanks a lot for the discussion so far!
>>
>> Here's a new version of the series addressing some comments from the
>> discussion and (most importantly) replacing patches 4 and 5 with another
>> (simpler) patch.
>
> Oddity: these patches seemingly manage to cost a bit of power when
> lightly loaded.  (but didn't cut cross core nohz cost much.. darn)
>
> i4790 booted nopti nospectre_v2
>
> 30 sec tbench
> 4.16.0.g1b88acc-master (virgin)
> Throughput 559.279 MB/sec  1 clients  1 procs  max_latency=0.046 ms
> Throughput 997.119 MB/sec  2 clients  2 procs  max_latency=0.246 ms
> Throughput 1693.04 MB/sec  4 clients  4 procs  max_latency=4.309 ms
> Throughput 3597.2 MB/sec  8 clients  8 procs  max_latency=6.760 ms
> Throughput 3474.55 MB/sec  16 clients  16 procs  max_latency=6.743 ms
>
> 4.16.0.g1b88acc-master (+v2)
> Throughput 588.929 MB/sec  1 clients  1 procs  max_latency=0.291 ms
> Throughput 1080.93 MB/sec  2 clients  2 procs  max_latency=0.639 ms
> Throughput 1826.3 MB/sec  4 clients  4 procs  max_latency=0.647 ms
> Throughput 3561.01 MB/sec  8 clients  8 procs  max_latency=1.279 ms
> Throughput 3382.98 MB/sec  16 clients  16 procs  max_latency=4.817 ms

max_latency is much lower here for >2 clients/procs, but at the same
time it is much higher for <=2 clients/procs (which then may be
related to the somewhat higher throughput).  Interesting.

> 4.16.0.g1b88acc-master (+local nohz mitigation etc for reference [1])
> Throughput 722.559 MB/sec  1 clients  1 procs  max_latency=0.087 ms
> Throughput 1208.59 MB/sec  2 clients  2 procs  max_latency=0.289 ms
> Throughput 2071.94 MB/sec  4 clients  4 procs  max_latency=0.654 ms
> Throughput 3784.91 MB/sec  8 clients  8 procs  max_latency=0.974 ms
> Throughput 3644.4 MB/sec  16 clients  16 procs  max_latency=5.620 ms
>
> turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox
>
>                         PkgWatt
>                               1     2     3
> 4.16.0.g1b88acc-master     6.95  7.03  6.91 (virgin)
> 4.16.0.g1b88acc-master     7.20  7.25  7.26 (+v2)
> 4.16.0.g1b88acc-master     6.90  7.06  6.95 (+local)
>
> Why would v2 charge the light firefox load a small but consistent fee?

Two effects may come into play here I think.

One is that allowing the tick to run biases the menu governor's
predictions towards the lower end, so we may use shallow states more
as a result then (Peter was talking about that).

The second one may be that intermediate states are used quite a bit
"by nature" in this workload (that should be quite straightforward to
verify) and stopping the tick for them saves some energy on idle
entry/exit.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework
  2018-03-08 11:10   ` Rafael J. Wysocki
@ 2018-03-08 13:40     ` Mike Galbraith
  2018-03-09  9:58       ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Galbraith @ 2018-03-08 13:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, LKML

On Thu, 2018-03-08 at 12:10 +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <mgalbraith@suse.de> wrote:
>                               1     2     3
> > 4.16.0.g1b88acc-master     6.95  7.03  6.91 (virgin)
> > 4.16.0.g1b88acc-master     7.20  7.25  7.26 (+v2)
> > 4.16.0.g1b88acc-master     6.90  7.06  6.95 (+local)
> >
> > Why would v2 charge the light firefox load a small but consistent fee?
> 
> Two effects may come into play here I think.
> 
> One is that allowing the tick to run biases the menu governor's
> predictions towards the lower end, so we may use shallow states more
> as a result then (Peter was talking about that).

Hm, I'd expect that to show up in +local as well then, as it keeps the
tick running when avg_idle < sched_migration_cost (convenient magic
number), but the firefox load runs at the same wattage as virgin.  I'm
also doing this...

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -335,7 +335,7 @@ static int menu_select(struct cpuidle_dr
                 * C1's exit latency exceeds the user configured limit.
                 */
                polling_threshold = max_t(unsigned int, 20, s->target_residency);
-               if (data->next_timer_us > polling_threshold &&
+               if (expected_interval > polling_threshold &&
                    latency_req > s->exit_latency && !s->disabled &&
                    !dev->states_usage[1].disable)
                        first_idx = 1;

...to help out high frequency cross core throughput, but the firefox
load apparently doesn't tickle that, as significant polling would
surely show in the wattage.

	-Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-08  9:22     ` Rafael J. Wysocki
@ 2018-03-08 15:14       ` Frederic Weisbecker
  2018-03-08 16:34         ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2018-03-08 15:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, Thomas Gleixner, Frederic Weisbecker,
	Paul McKenney, Thomas Ilsche, Doug Smythies, Rik van Riel,
	Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> > > Index: linux-pm/kernel/time/tick-sched.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/tick-sched.c
> > > +++ linux-pm/kernel/time/tick-sched.c
> > > +
> > > +/**
> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> > > + *
> > > + * Called when we start the idle loop.
> > > + */
> > > +void tick_nohz_idle_prepare(void)
> > > +{
> > > +	struct tick_sched *ts;
> > > +
> > > +	__tick_nohz_idle_prepare();
> > > +
> > > +	local_irq_disable();
> > > +
> > > +	ts = this_cpu_ptr(&tick_cpu_sched);
> > > +	ts->inidle = 1;
> > > +
> > > +	local_irq_enable();
> > > +}
> > 
> > Why not calling tick_nohz_start_idle() from there? This is going to
> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> > from places that don't want to stop the tick and you can then remove
> > the stop_tick argument.
> 
> So I guess I would then use ts->idle_entrytime as "now" in the
> tick_nohz_get_sleep_length() computation, right?

Ah right, I missed the need for ktime_get().

You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
full dynticks doesn't rely on it.

But I think you can just do the following, with a comment explaining that
idle_entrytime is expected to be fresh enough at this point:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 57b3de4..8e61796 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
 
 		ts->idle_calls++;
 
-		expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+		expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
 		if (expires > 0LL) {
 			ts->idle_sleeps++;
 			ts->idle_expires = expires;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-08 15:14       ` Frederic Weisbecker
@ 2018-03-08 16:34         ` Rafael J. Wysocki
  2018-03-08 17:00           ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-08 16:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 8, 2018 at 4:14 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
>> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
>> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
>> > > Index: linux-pm/kernel/time/tick-sched.c
>> > > ===================================================================
>> > > --- linux-pm.orig/kernel/time/tick-sched.c
>> > > +++ linux-pm/kernel/time/tick-sched.c
>> > > +
>> > > +/**
>> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
>> > > + *
>> > > + * Called when we start the idle loop.
>> > > + */
>> > > +void tick_nohz_idle_prepare(void)
>> > > +{
>> > > + struct tick_sched *ts;
>> > > +
>> > > + __tick_nohz_idle_prepare();
>> > > +
>> > > + local_irq_disable();
>> > > +
>> > > + ts = this_cpu_ptr(&tick_cpu_sched);
>> > > + ts->inidle = 1;
>> > > +
>> > > + local_irq_enable();
>> > > +}
>> >
>> > Why not calling tick_nohz_start_idle() from there? This is going to
>> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
>> > from places that don't want to stop the tick and you can then remove
>> > the stop_tick argument.
>>
>> So I guess I would then use ts->idle_entrytime as "now" in the
>> tick_nohz_get_sleep_length() computation, right?
>
> Ah right, I missed the need for ktime_get().
>
> You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
> full dynticks doesn't rely on it.
>
> But I think you can just do the following, with a comment explaining that
> idle_entrytime is expected to be fresh enough at this point:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 57b3de4..8e61796 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
>
>                 ts->idle_calls++;
>
> -               expires = tick_nohz_stop_sched_tick(ts, now, cpu);
> +               expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
>                 if (expires > 0LL) {
>                         ts->idle_sleeps++;
>                         ts->idle_expires = expires;
>

That's what I was thinking about, but ktime_get() seems to be working too.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code
  2018-03-08 16:34         ` Rafael J. Wysocki
@ 2018-03-08 17:00           ` Frederic Weisbecker
  0 siblings, 0 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2018-03-08 17:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM, Thomas Gleixner,
	Frederic Weisbecker, Paul McKenney, Thomas Ilsche, Doug Smythies,
	Rik van Riel, Aubrey Li, Mike Galbraith, LKML

On Thu, Mar 08, 2018 at 05:34:20PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 8, 2018 at 4:14 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
> >> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> >> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> >> > > Index: linux-pm/kernel/time/tick-sched.c
> >> > > ===================================================================
> >> > > --- linux-pm.orig/kernel/time/tick-sched.c
> >> > > +++ linux-pm/kernel/time/tick-sched.c
> >> > > +
> >> > > +/**
> >> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> >> > > + *
> >> > > + * Called when we start the idle loop.
> >> > > + */
> >> > > +void tick_nohz_idle_prepare(void)
> >> > > +{
> >> > > + struct tick_sched *ts;
> >> > > +
> >> > > + __tick_nohz_idle_prepare();
> >> > > +
> >> > > + local_irq_disable();
> >> > > +
> >> > > + ts = this_cpu_ptr(&tick_cpu_sched);
> >> > > + ts->inidle = 1;
> >> > > +
> >> > > + local_irq_enable();
> >> > > +}
> >> >
> >> > Why not calling tick_nohz_start_idle() from there? This is going to
> >> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> >> > from places that don't want to stop the tick and you can then remove
> >> > the stop_tick argument.
> >>
> >> So I guess I would then use ts->idle_entrytime as "now" in the
> >> tick_nohz_get_sleep_length() computation, right?
> >
> > Ah right, I missed the need for ktime_get().
> >
> > You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
> > full dynticks doesn't rely on it.
> >
> > But I think you can just do the following, with a comment explaining that
> > idle_entrytime is expected to be fresh enough at this point:
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 57b3de4..8e61796 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
> >
> >                 ts->idle_calls++;
> >
> > -               expires = tick_nohz_stop_sched_tick(ts, now, cpu);
> > +               expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
> >                 if (expires > 0LL) {
> >                         ts->idle_sleeps++;
> >                         ts->idle_expires = expires;
> >
> 
> That's what I was thinking about, but ktime_get() seems to be working too.

Well, ktime_get() has its share of overhead, so if we can avoid a call and use
a cache...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework
  2018-03-08 13:40     ` Mike Galbraith
@ 2018-03-09  9:58       ` Rafael J. Wysocki
  0 siblings, 0 replies; 20+ messages in thread
From: Rafael J. Wysocki @ 2018-03-09  9:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Peter Zijlstra, Linux PM,
	Thomas Gleixner, Frederic Weisbecker, Paul McKenney,
	Thomas Ilsche, Doug Smythies, Rik van Riel, Aubrey Li, LKML

On Thu, Mar 8, 2018 at 2:40 PM, Mike Galbraith <mgalbraith@suse.de> wrote:
> On Thu, 2018-03-08 at 12:10 +0100, Rafael J. Wysocki wrote:
>> On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <mgalbraith@suse.de> wrote:
>>                               1     2     3
>> > 4.16.0.g1b88acc-master     6.95  7.03  6.91 (virgin)
>> > 4.16.0.g1b88acc-master     7.20  7.25  7.26 (+v2)
>> > 4.16.0.g1b88acc-master     6.90  7.06  6.95 (+local)
>> >
>> > Why would v2 charge the light firefox load a small but consistent fee?
>>
>> Two effects may come into play here I think.
>>
>> One is that allowing the tick to run biases the menu governor's
>> predictions towards the lower end, so we may use shallow states more
>> as a result then (Peter was talking about that).
>
> Hm, I'd expect that to show up in +local as well then, as it keeps the
> tick running when avg_idle < sched_migration_cost (convenient magic
> number), but the firefox load runs at the same wattage as virgin.  I'm
> also doing this...
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -335,7 +335,7 @@ static int menu_select(struct cpuidle_dr
>                  * C1's exit latency exceeds the user configured limit.
>                  */
>                 polling_threshold = max_t(unsigned int, 20, s->target_residency);
> -               if (data->next_timer_us > polling_threshold &&
> +               if (expected_interval > polling_threshold &&
>                     latency_req > s->exit_latency && !s->disabled &&
>                     !dev->states_usage[1].disable)
>                         first_idx = 1;
>
> ...to help out high frequency cross core throughput, but the firefox
> load apparently doesn't tickle that, as significant polling would
> surely show in the wattage.

OK, so the second reason sounds more likely to me.

Anyway, please retest with the v3 I've just posted.  The previous
iteration had a rather serious issue that might very well influence
the results (it was using stale values sometimes).

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-03-09  9:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-06  8:57 [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Rafael J. Wysocki
2018-03-06  9:02 ` [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
2018-03-07 23:18   ` Frederic Weisbecker
2018-03-08  9:22     ` Rafael J. Wysocki
2018-03-08 15:14       ` Frederic Weisbecker
2018-03-08 16:34         ` Rafael J. Wysocki
2018-03-08 17:00           ` Frederic Weisbecker
2018-03-06  9:02 ` [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
2018-03-07 23:39   ` Frederic Weisbecker
2018-03-08  9:05     ` Rafael J. Wysocki
2018-03-06  9:03 ` [RFC/RFT][PATCH v2 3/6] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
2018-03-06  9:05 ` [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
2018-03-06  9:28   ` Rafael J. Wysocki
2018-03-06 10:06   ` [Update][RFC/RFT][PATCH " Rafael J. Wysocki
2018-03-06  9:10 ` [RFC/RFT][PATCH v2 5/6] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
2018-03-06  9:10 ` [RFC/RFT][PATCH v2 6/6] time: tick-sched: Avoid running the same code twice in a row Rafael J. Wysocki
2018-03-08 10:31 ` [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework Mike Galbraith
2018-03-08 11:10   ` Rafael J. Wysocki
2018-03-08 13:40     ` Mike Galbraith
2018-03-09  9:58       ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.