All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
@ 2017-09-30  7:20 Aubrey Li
  2017-09-30  7:20 ` [RFC PATCH v2 1/8] cpuidle: menu: extract " Aubrey Li
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen; +Cc: x86, linux-kernel, Aubrey Li

We found under some latency intensive workloads, short idle periods occurs
very common, then idle entry and exit path starts to dominate, so it's
important to optimize them. To determine the short idle pattern, we need
to figure out how long of the coming idle and the threshold of the short
idle interval.

A cpu idle prediction functionality is introduced in this proposal to catch
the short idle pattern.

Firstly, we check the IRQ timings subsystem, if there is an event
coming soon.
-- https://lwn.net/Articles/691297/

Secondly, we check the idle statistics of scheduler, if it's likely we'll
go into a short idle.
-- https://patchwork.kernel.org/patch/2839221/

Thirdly, we predict the next idle interval by using the prediction
fucntionality in the idle governor if it has.

For the threshold of the short idle interval, we record the timestamps of
the idle entry, and multiply by a tunable parameter at here:
-- /proc/sys/kernel/fast_idle_ratio

We use the output of the idle prediction to skip turning tick off if a
short idle is determined in this proposal. Reprogramming hardware timer
twice(off and on) is expensive for a very short idle. There are some
potential optimizations can be done according to the same indicator.

I observed when system is idle, the idle predictor reports 20/s long idle
and ZERO fast idle on one CPU. And when the workload is running, the idle
predictor reports 72899/s fast idle and ZERO long idle on the same CPU.

Aubrey Li (8):
  cpuidle: menu: extract prediction functionality
  cpuidle: record the overhead of idle entry
  cpuidle: add a new predict interface
  tick/nohz: keep tick on for a fast idle
  timers: keep sleep length updated as needed
  cpuidle: make fast idle threshold tunable
  cpuidle: introduce irq timing to make idle prediction
  cpuidle: introduce run queue average idle to make idle prediction

 drivers/cpuidle/Kconfig          |   1 +
 drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
 drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
 include/linux/cpuidle.h          |  21 ++++++++
 kernel/sched/idle.c              |  14 ++++-
 kernel/sysctl.c                  |  12 +++++
 kernel/time/tick-sched.c         |   7 +++
 7 files changed, 209 insertions(+), 24 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:26   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Aubrey Li
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

There are several factors in the menu governor to predict the next
idle interval:
- the next timer
- the recent idle interval history
- the corrected idle interval pattern
These factors are common enough to be extracted to be one function.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/governors/menu.c | 64 +++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 24 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 61b64c2..6bed197 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -276,6 +276,45 @@ static unsigned int get_typical_interval(struct menu_device *data)
 }
 
 /**
+ * menu_predict - predict the coming idle interval
+ *
+ * Return the predicted coming idle interval in micro-second
+ */
+static unsigned int menu_predict(void)
+{
+	struct menu_device *data = this_cpu_ptr(&menu_devices);
+	unsigned int expected_interval;
+	int cpu = smp_processor_id();
+
+	if (!data)
+		return UINT_MAX;
+
+	/* determine the expected residency time, round up */
+	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+
+	data->bucket = which_bucket(data->next_timer_us, nr_iowait_cpu(cpu));
+
+	/*
+	 * Force the result of multiplication to be 64 bits even if both
+	 * operands are 32 bits.
+	 * Make sure to round up for half microseconds.
+	 */
+	data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)
+		data->next_timer_us * data->correction_factor[data->bucket],
+		RESOLUTION * DECAY);
+
+	expected_interval = get_typical_interval(data);
+	expected_interval = min(expected_interval, data->next_timer_us);
+
+	/*
+	 * Use the lowest expected idle interval to pick the idle state.
+	 */
+	data->predicted_us = min(data->predicted_us, expected_interval);
+
+	return data->predicted_us;
+}
+
+/**
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
@@ -289,7 +328,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	int first_idx;
 	int idx;
 	unsigned int interactivity_req;
-	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 	int resume_latency = dev_pm_qos_raw_read_value(device);
 
@@ -306,24 +344,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	if (unlikely(latency_req == 0))
 		return 0;
 
-	/* determine the expected residency time, round up */
-	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
-
-	get_iowait_load(&nr_iowaiters, &cpu_load);
-	data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
-
-	/*
-	 * Force the result of multiplication to be 64 bits even if both
-	 * operands are 32 bits.
-	 * Make sure to round up for half microseconds.
-	 */
-	data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)data->next_timer_us *
-					 data->correction_factor[data->bucket],
-					 RESOLUTION * DECAY);
-
-	expected_interval = get_typical_interval(data);
-	expected_interval = min(expected_interval, data->next_timer_us);
-
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
 		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
 		unsigned int polling_threshold;
@@ -345,14 +365,10 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	}
 
 	/*
-	 * Use the lowest expected idle interval to pick the idle state.
-	 */
-	data->predicted_us = min(data->predicted_us, expected_interval);
-
-	/*
 	 * Use the performance multiplier and the user-configurable
 	 * latency_req to determine the maximum exit latency.
 	 */
+	get_iowait_load(&nr_iowaiters, &cpu_load);
 	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
 	if (latency_req > interactivity_req)
 		latency_req = interactivity_req;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
  2017-09-30  7:20 ` [RFC PATCH v2 1/8] cpuidle: menu: extract " Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:35   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Aubrey Li
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

Record the overhead of idle entry in micro-second

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/cpuidle.c | 33 +++++++++++++++++++++++++++++++++
 include/linux/cpuidle.h   | 14 ++++++++++++++
 kernel/sched/idle.c       |  8 +++++++-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 60bb64f..4066308 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -302,6 +302,39 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
 		cpuidle_curr_governor->reflect(dev, index);
 }
 
+/* cpuidle_entry_start - record idle entry start */
+void cpuidle_entry_start(void)
+{
+	struct cpuidle_device *dev = cpuidle_get_device();
+
+	if (dev)
+		dev->idle_stat.entry_start = local_clock();
+}
+
+/*
+ * cpuidle_entry_end - record idle entry end, and maintain
+ * the entry overhead average in micro-second
+ */
+void cpuidle_entry_end(void)
+{
+	struct cpuidle_device *dev = cpuidle_get_device();
+	u64 overhead;
+	s64 diff;
+
+	if (dev) {
+		dev->idle_stat.entry_end = local_clock();
+		overhead = div_u64(dev->idle_stat.entry_end -
+				dev->idle_stat.entry_start, NSEC_PER_USEC);
+		diff = overhead - dev->idle_stat.overhead;
+		dev->idle_stat.overhead += diff >> 3;
+		/*
+		 * limit overhead to 1us
+		 */
+		if (dev->idle_stat.overhead == 0)
+			dev->idle_stat.overhead = 1;
+	}
+}
+
 /**
  * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
  */
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index fc1e5d7..cad9b71 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
 struct cpuidle_state_kobj;
 struct cpuidle_driver_kobj;
 
+struct cpuidle_stat {
+	u64			entry_start;	/* nanosecond */
+	u64			entry_end;	/* nanosecond */
+	u64			overhead;	/* nanosecond */
+	unsigned int		predicted_us;	/* microsecond */
+	bool			predicted;	/* ever predicted? */
+	bool			fast_idle;	/* fast idle? */
+};
+
 struct cpuidle_device {
 	unsigned int		registered:1;
 	unsigned int		enabled:1;
@@ -89,6 +98,7 @@ struct cpuidle_device {
 	cpumask_t		coupled_cpus;
 	struct cpuidle_coupled	*coupled;
 #endif
+	struct cpuidle_stat	idle_stat;
 };
 
 DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
@@ -131,6 +141,8 @@ extern bool cpuidle_not_available(struct cpuidle_driver *drv,
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
 			  struct cpuidle_device *dev);
+extern void cpuidle_entry_start(void);
+extern void cpuidle_entry_end(void);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -164,6 +176,8 @@ static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
 static inline int cpuidle_select(struct cpuidle_driver *drv,
 				 struct cpuidle_device *dev)
 {return -ENODEV; }
+static inline void cpuidle_entry_start(void) { }
+static inline void cpuidle_entry_end(void) { }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
 {return -ENODEV; }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..0951dac 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -210,6 +210,12 @@ static void cpuidle_idle_call(void)
 static void do_idle(void)
 {
 	/*
+	 * we record idle entry overhead now, so any deferrable items
+	 * in idle entry path need to be placed between cpuidle_entry_start()
+	 * and cpuidle_entry_end()
+	 */
+	cpuidle_entry_start();
+	/*
 	 * If the arch has a polling bit, we maintain an invariant:
 	 *
 	 * Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
@@ -217,10 +223,10 @@ static void do_idle(void)
 	 * then setting need_resched is guaranteed to cause the CPU to
 	 * reschedule.
 	 */
-
 	__current_set_polling();
 	quiet_vmstat();
 	tick_nohz_idle_enter();
+	cpuidle_entry_end();
 
 	while (!need_resched()) {
 		check_pgt_cache();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 3/8] cpuidle: add a new predict interface
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
  2017-09-30  7:20 ` [RFC PATCH v2 1/8] cpuidle: menu: extract " Aubrey Li
  2017-09-30  7:20 ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:45   ` Rafael J. Wysocki
  2017-10-14  1:27   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Aubrey Li
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen; +Cc: x86, linux-kernel, Aubrey Li

For the governor has predict functionality, add a new predict
interface in cpuidle framework to call and use it.
---
 drivers/cpuidle/cpuidle.c        | 34 ++++++++++++++++++++++++++++++++++
 drivers/cpuidle/governors/menu.c |  7 +++++++
 include/linux/cpuidle.h          |  3 +++
 kernel/sched/idle.c              |  1 +
 4 files changed, 45 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 4066308..ef6f7dd 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
 }
 
 /**
+ * cpuidle_predict - predict whether the coming idle is a fast idle or not
+ */
+void cpuidle_predict(void)
+{
+	struct cpuidle_device *dev = cpuidle_get_device();
+	unsigned int overhead_threshold;
+
+	if (!dev)
+		return;
+
+	overhead_threshold = dev->idle_stat.overhead;
+
+	if (cpuidle_curr_governor->predict) {
+		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
+		/*
+		 * notify idle governor to avoid reduplicative
+		 * prediction computation
+		 */
+		dev->idle_stat.predicted = true;
+		if (dev->idle_stat.predicted_us < overhead_threshold) {
+			/*
+			 * notify tick subsystem to keep ticking
+			 * for the coming idle
+			 */
+			dev->idle_stat.fast_idle = true;
+		} else
+			dev->idle_stat.fast_idle = false;
+	} else {
+		dev->idle_stat.predicted = false;
+		dev->idle_stat.fast_idle = false;
+	}
+}
+
+/**
  * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
  */
 void cpuidle_install_idle_handler(void)
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 6bed197..90b2a10 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	if (unlikely(latency_req == 0))
 		return 0;
 
+	/*don't predict again if idle framework already did it */
+	if (!dev->idle_stat.predicted)
+		menu_predict();
+	else
+		dev->idle_stat.predicted = false;
+
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
 		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
 		unsigned int polling_threshold;
@@ -518,6 +524,7 @@ static struct cpuidle_governor menu_governor = {
 	.enable =	menu_enable_device,
 	.select =	menu_select,
 	.reflect =	menu_reflect,
+	.predict =	menu_predict,
 };
 
 /**
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index cad9b71..9ca0288 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -143,6 +143,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
 			  struct cpuidle_device *dev);
 extern void cpuidle_entry_start(void);
 extern void cpuidle_entry_end(void);
+extern void cpuidle_predict(void);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -178,6 +179,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
 {return -ENODEV; }
 static inline void cpuidle_entry_start(void) { }
 static inline void cpuidle_entry_end(void) { }
+static inline void cpuidle_predict(void) { }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
 {return -ENODEV; }
@@ -255,6 +257,7 @@ struct cpuidle_governor {
 	int  (*select)		(struct cpuidle_driver *drv,
 					struct cpuidle_device *dev);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
+	unsigned int (*predict)(void);
 };
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0951dac..8704f3c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -225,6 +225,7 @@ static void do_idle(void)
 	 */
 	__current_set_polling();
 	quiet_vmstat();
+	cpuidle_predict();
 	tick_nohz_idle_enter();
 	cpuidle_entry_end();
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (2 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:51   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Aubrey Li
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

If the next idle is expected to be a fast idle, we should keep tick
on before going into idle

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/cpuidle.c | 14 ++++++++++++++
 include/linux/cpuidle.h   |  2 ++
 kernel/time/tick-sched.c  |  4 ++++
 3 files changed, 20 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ef6f7dd..6cb7e17 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -370,6 +370,20 @@ void cpuidle_predict(void)
 }
 
 /**
+ * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
+ * This function can be called in irq exit path, make it as soon as possible
+ */
+bool cpuidle_fast_idle(void)
+{
+	struct cpuidle_device *dev = cpuidle_get_device();
+
+	if (!dev)
+		return false;
+
+	return dev->idle_stat.fast_idle;
+}
+
+/**
  * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
  */
 void cpuidle_install_idle_handler(void)
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 9ca0288..791db15 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -144,6 +144,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
 extern void cpuidle_entry_start(void);
 extern void cpuidle_entry_end(void);
 extern void cpuidle_predict(void);
+extern bool cpuidle_fast_idle(void);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -180,6 +181,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
 static inline void cpuidle_entry_start(void) { }
 static inline void cpuidle_entry_end(void) { }
 static inline void cpuidle_predict(void) { }
+static inline void cpuidle_fast_idle(void) {return false; }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
 {return -ENODEV; }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c..d663fab 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/cpuidle.h>
 
 #include <asm/irq_regs.h>
 
@@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 			return false;
 	}
 
+	if (cpuidle_fast_idle())
+		return false;
+
 	return true;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (3 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:56   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Aubrey Li
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen; +Cc: x86, linux-kernel, Aubrey Li

sleep length indicates how long we'll be idle. Currently, it's updated
only when tick nohz enters. These patch series make a new requirement
with tick, so we should keep sleep length updated as needed
---
 kernel/time/tick-sched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d663fab..94fb9b8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
  */
 ktime_t tick_nohz_get_sleep_length(void)
 {
+	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
+	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
+
 	return ts->sleep_length;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (4 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  0:59   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction Aubrey Li
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

Add a knob to make fast idle threshold tunable

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/cpuidle.c |  3 ++-
 include/linux/cpuidle.h   |  1 +
 kernel/sysctl.c           | 12 ++++++++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 6cb7e17..5d4f0b6 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -35,6 +35,7 @@ LIST_HEAD(cpuidle_detected_devices);
 static int enabled_devices;
 static int off __read_mostly;
 static int initialized __read_mostly;
+int sysctl_fast_idle_ratio = 10;
 
 int cpuidle_disabled(void)
 {
@@ -346,7 +347,7 @@ void cpuidle_predict(void)
 	if (!dev)
 		return;
 
-	overhead_threshold = dev->idle_stat.overhead;
+	overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
 
 	if (cpuidle_curr_governor->predict) {
 		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 791db15..45b8264 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -24,6 +24,7 @@ struct module;
 struct cpuidle_device;
 struct cpuidle_driver;
 
+extern int sysctl_fast_idle_ratio;
 
 /****************************
  * CPUIDLE DEVICE INTERFACE *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..97f7e8af 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/kexec.h>
 #include <linux/bpf.h>
 #include <linux/mount.h>
+#include <linux/cpuidle.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_CPU_IDLE
+	{
+		.procname       = "fast_idle_ratio",
+		.data           = &sysctl_fast_idle_ratio,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = &one,
+		.extra2         = &one_hundred,
+	},
+#endif
 	{ }
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (5 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  1:01   ` Rafael J. Wysocki
  2017-09-30  7:20 ` [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle " Aubrey Li
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

Introduce irq timings output as a factor to predict the duration
of the coming idle

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/Kconfig   |  1 +
 drivers/cpuidle/cpuidle.c | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7e48eb5..8b07e1c 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -5,6 +5,7 @@ config CPU_IDLE
 	default y if ACPI || PPC_PSERIES
 	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
 	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
+	select IRQ_TIMINGS
 	help
 	  CPU idle is a generic framework for supporting software-controlled
 	  idle processor power management.  It includes modular cross-platform
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 5d4f0b6..be56cea 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/suspend.h>
 #include <linux/tick.h>
+#include <linux/interrupt.h>
 #include <trace/events/power.h>
 
 #include "cpuidle.h"
@@ -342,13 +343,27 @@ void cpuidle_entry_end(void)
 void cpuidle_predict(void)
 {
 	struct cpuidle_device *dev = cpuidle_get_device();
-	unsigned int overhead_threshold;
+	unsigned int idle_interval, overhead_threshold;
+	u64 now, next_evt;
 
 	if (!dev)
 		return;
 
 	overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
 
+	/*
+	 * check irq timings if the next event is coming soon
+	 */
+	now = local_clock();
+	local_irq_disable();
+	next_evt = irq_timings_next_event(now);
+	local_irq_enable();
+	idle_interval = div_u64(next_evt - now, NSEC_PER_USEC);
+	if (idle_interval < overhead_threshold) {
+		dev->idle_stat.fast_idle = true;
+		return;
+	}
+
 	if (cpuidle_curr_governor->predict) {
 		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
 		/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle to make idle prediction
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (6 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction Aubrey Li
@ 2017-09-30  7:20 ` Aubrey Li
  2017-10-14  1:02   ` Rafael J. Wysocki
  2017-10-14  1:14 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Rafael J. Wysocki
  2017-11-30  1:00 ` Li, Aubrey
  9 siblings, 1 reply; 42+ messages in thread
From: Aubrey Li @ 2017-09-30  7:20 UTC (permalink / raw)
  To: tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: x86, linux-kernel, Aubrey Li, Aubrey Li

Introduce run queue average idle in scheduler as a factor to make
idle prediction

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 drivers/cpuidle/cpuidle.c | 12 ++++++++++++
 include/linux/cpuidle.h   |  1 +
 kernel/sched/idle.c       |  5 +++++
 3 files changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index be56cea..9424a2d 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -364,6 +364,18 @@ void cpuidle_predict(void)
 		return;
 	}
 
+	/*
+	 * check scheduler if the coming idle is likely a fast idle
+	 */
+	idle_interval = div_u64(sched_idle_avg(), NSEC_PER_USEC);
+	if (idle_interval < overhead_threshold) {
+		dev->idle_stat.fast_idle = true;
+		return;
+	}
+
+	/*
+	 * check the idle governor if the coming idle is likely a fast idle
+	 */
 	if (cpuidle_curr_governor->predict) {
 		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
 		/*
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 45b8264..387d72b 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -234,6 +234,7 @@ static inline void cpuidle_use_deepest_state(bool enable)
 /* kernel/sched/idle.c */
 extern void sched_idle_set_state(struct cpuidle_state *idle_state);
 extern void default_idle_call(void);
+extern u64 sched_idle_avg(void);
 
 #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
 void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8704f3c..d23b472 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -30,6 +30,11 @@ void sched_idle_set_state(struct cpuidle_state *idle_state)
 	idle_set_state(this_rq(), idle_state);
 }
 
+u64 sched_idle_avg(void)
+{
+	return this_rq()->avg_idle;
+}
+
 static int __read_mostly cpu_idle_force_poll;
 
 void cpu_idle_poll_ctrl(bool enable)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality
  2017-09-30  7:20 ` [RFC PATCH v2 1/8] cpuidle: menu: extract " Aubrey Li
@ 2017-10-14  0:26   ` Rafael J. Wysocki
  2017-10-16  2:46     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:26 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:27 AM CEST Aubrey Li wrote:
> There are several factors in the menu governor to predict the next
> idle interval:
> - the next timer
> - the recent idle interval history
> - the corrected idle interval pattern
> These factors are common enough to be extracted to be one function.
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>

This patch alone would break things AFAICS, because it removes code from
menu_select() without a replacement (and menu_predict() is never called
just yet).

Please always do your best to ensure that things will work after *every*
patch in a series.

> ---
>  drivers/cpuidle/governors/menu.c | 64 +++++++++++++++++++++++++---------------
>  1 file changed, 40 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 61b64c2..6bed197 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -276,6 +276,45 @@ static unsigned int get_typical_interval(struct menu_device *data)
>  }
>  
>  /**
> + * menu_predict - predict the coming idle interval
> + *
> + * Return the predicted coming idle interval in micro-second
> + */
> +static unsigned int menu_predict(void)
> +{
> +	struct menu_device *data = this_cpu_ptr(&menu_devices);
> +	unsigned int expected_interval;
> +	int cpu = smp_processor_id();
> +
> +	if (!data)
> +		return UINT_MAX;
> +
> +	/* determine the expected residency time, round up */
> +	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
> +
> +	data->bucket = which_bucket(data->next_timer_us, nr_iowait_cpu(cpu));
> +
> +	/*
> +	 * Force the result of multiplication to be 64 bits even if both
> +	 * operands are 32 bits.
> +	 * Make sure to round up for half microseconds.
> +	 */
> +	data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)
> +		data->next_timer_us * data->correction_factor[data->bucket],
> +		RESOLUTION * DECAY);
> +
> +	expected_interval = get_typical_interval(data);
> +	expected_interval = min(expected_interval, data->next_timer_us);
> +
> +	/*
> +	 * Use the lowest expected idle interval to pick the idle state.
> +	 */
> +	data->predicted_us = min(data->predicted_us, expected_interval);
> +
> +	return data->predicted_us;
> +}
> +
> +/**
>   * menu_select - selects the next idle state to enter
>   * @drv: cpuidle driver containing state data
>   * @dev: the CPU
> @@ -289,7 +328,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	int first_idx;
>  	int idx;
>  	unsigned int interactivity_req;
> -	unsigned int expected_interval;
>  	unsigned long nr_iowaiters, cpu_load;
>  	int resume_latency = dev_pm_qos_raw_read_value(device);
>  
> @@ -306,24 +344,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	if (unlikely(latency_req == 0))
>  		return 0;
>  
> -	/* determine the expected residency time, round up */
> -	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
> -
> -	get_iowait_load(&nr_iowaiters, &cpu_load);
> -	data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
> -
> -	/*
> -	 * Force the result of multiplication to be 64 bits even if both
> -	 * operands are 32 bits.
> -	 * Make sure to round up for half microseconds.
> -	 */
> -	data->predicted_us = DIV_ROUND_CLOSEST_ULL((uint64_t)data->next_timer_us *
> -					 data->correction_factor[data->bucket],
> -					 RESOLUTION * DECAY);
> -
> -	expected_interval = get_typical_interval(data);
> -	expected_interval = min(expected_interval, data->next_timer_us);
> -
>  	if (CPUIDLE_DRIVER_STATE_START > 0) {
>  		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
>  		unsigned int polling_threshold;
> @@ -345,14 +365,10 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	}
>  
>  	/*
> -	 * Use the lowest expected idle interval to pick the idle state.
> -	 */
> -	data->predicted_us = min(data->predicted_us, expected_interval);
> -
> -	/*
>  	 * Use the performance multiplier and the user-configurable
>  	 * latency_req to determine the maximum exit latency.
>  	 */
> +	get_iowait_load(&nr_iowaiters, &cpu_load);
>  	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
>  	if (latency_req > interactivity_req)
>  		latency_req = interactivity_req;
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-09-30  7:20 ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Aubrey Li
@ 2017-10-14  0:35   ` Rafael J. Wysocki
  2017-10-16  3:11     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:35 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
> Record the overhead of idle entry in micro-second
> 

What is this needed for?

> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
>  drivers/cpuidle/cpuidle.c | 33 +++++++++++++++++++++++++++++++++
>  include/linux/cpuidle.h   | 14 ++++++++++++++
>  kernel/sched/idle.c       |  8 +++++++-
>  3 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 60bb64f..4066308 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -302,6 +302,39 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
>  		cpuidle_curr_governor->reflect(dev, index);
>  }
>  
> +/* cpuidle_entry_start - record idle entry start */

A proper kerneldoc, please.

> +void cpuidle_entry_start(void)
> +{
> +	struct cpuidle_device *dev = cpuidle_get_device();
> +
> +	if (dev)
> +		dev->idle_stat.entry_start = local_clock();
> +}
> +
> +/*
> + * cpuidle_entry_end - record idle entry end, and maintain
> + * the entry overhead average in micro-second
> + */

A proper kerneldoc, please.

> +void cpuidle_entry_end(void)
> +{
> +	struct cpuidle_device *dev = cpuidle_get_device();
> +	u64 overhead;
> +	s64 diff;
> +
> +	if (dev) {
> +		dev->idle_stat.entry_end = local_clock();
> +		overhead = div_u64(dev->idle_stat.entry_end -
> +				dev->idle_stat.entry_start, NSEC_PER_USEC);

Is the conversion really necessary?

If so, then why?

And if there is a good reason, what about using right shift to do
an approximate conversion to avoid the extra division here?

> +		diff = overhead - dev->idle_stat.overhead;
> +		dev->idle_stat.overhead += diff >> 3;

Can you please explain what happens in the two lines above?

> +		/*
> +		 * limit overhead to 1us
> +		 */
> +		if (dev->idle_stat.overhead == 0)
> +			dev->idle_stat.overhead = 1;
> +	}
> +}
> +
>  /**
>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>   */
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index fc1e5d7..cad9b71 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
>  struct cpuidle_state_kobj;
>  struct cpuidle_driver_kobj;
>  
> +struct cpuidle_stat {
> +	u64			entry_start;	/* nanosecond */
> +	u64			entry_end;	/* nanosecond */
> +	u64			overhead;	/* nanosecond */
> +	unsigned int		predicted_us;	/* microsecond */
> +	bool			predicted;	/* ever predicted? */
> +	bool			fast_idle;	/* fast idle? */

Some of the fields here are never used in the code after this patch.

Also it would be good to add a comment describing the meaning of the
fields.

> +};
> +
>  struct cpuidle_device {
>  	unsigned int		registered:1;
>  	unsigned int		enabled:1;
> @@ -89,6 +98,7 @@ struct cpuidle_device {
>  	cpumask_t		coupled_cpus;
>  	struct cpuidle_coupled	*coupled;
>  #endif
> +	struct cpuidle_stat	idle_stat;
>  };
>  
>  DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
> @@ -131,6 +141,8 @@ extern bool cpuidle_not_available(struct cpuidle_driver *drv,
>  
>  extern int cpuidle_select(struct cpuidle_driver *drv,
>  			  struct cpuidle_device *dev);
> +extern void cpuidle_entry_start(void);
> +extern void cpuidle_entry_end(void);
>  extern int cpuidle_enter(struct cpuidle_driver *drv,
>  			 struct cpuidle_device *dev, int index);
>  extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
> @@ -164,6 +176,8 @@ static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
>  static inline int cpuidle_select(struct cpuidle_driver *drv,
>  				 struct cpuidle_device *dev)
>  {return -ENODEV; }
> +static inline void cpuidle_entry_start(void) { }
> +static inline void cpuidle_entry_end(void) { }
>  static inline int cpuidle_enter(struct cpuidle_driver *drv,
>  				struct cpuidle_device *dev, int index)
>  {return -ENODEV; }
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6c23e30..0951dac 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -210,6 +210,12 @@ static void cpuidle_idle_call(void)
>  static void do_idle(void)
>  {
>  	/*
> +	 * we record idle entry overhead now, so any deferrable items
> +	 * in idle entry path need to be placed between cpuidle_entry_start()
> +	 * and cpuidle_entry_end()
> +	 */
> +	cpuidle_entry_start();
> +	/*
>  	 * If the arch has a polling bit, we maintain an invariant:
>  	 *
>  	 * Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
> @@ -217,10 +223,10 @@ static void do_idle(void)
>  	 * then setting need_resched is guaranteed to cause the CPU to
>  	 * reschedule.
>  	 */
> -
>  	__current_set_polling();
>  	quiet_vmstat();
>  	tick_nohz_idle_enter();
> +	cpuidle_entry_end();
>  
>  	while (!need_resched()) {
>  		check_pgt_cache();
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
  2017-09-30  7:20 ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Aubrey Li
@ 2017-10-14  0:45   ` Rafael J. Wysocki
  2017-10-16  8:04     ` Li, Aubrey
  2017-10-14  1:27   ` Rafael J. Wysocki
  1 sibling, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:45 UTC (permalink / raw)
  To: Aubrey Li; +Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel

On Saturday, September 30, 2017 9:20:29 AM CEST Aubrey Li wrote:
> For the governor has predict functionality, add a new predict
> interface in cpuidle framework to call and use it.

Care to describe how it is intended to work?

Also this patch uses data structures introduced in the previous one
(and the previous one didn't use them), so maybe merge the two?

> ---
>  drivers/cpuidle/cpuidle.c        | 34 ++++++++++++++++++++++++++++++++++
>  drivers/cpuidle/governors/menu.c |  7 +++++++
>  include/linux/cpuidle.h          |  3 +++
>  kernel/sched/idle.c              |  1 +
>  4 files changed, 45 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 4066308..ef6f7dd 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
>  }
>  
>  /**
> + * cpuidle_predict - predict whether the coming idle is a fast idle or not
> + */
> +void cpuidle_predict(void)
> +{
> +	struct cpuidle_device *dev = cpuidle_get_device();
> +	unsigned int overhead_threshold;
> +
> +	if (!dev)
> +		return;
> +
> +	overhead_threshold = dev->idle_stat.overhead;
> +
> +	if (cpuidle_curr_governor->predict) {
> +		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
> +		/*
> +		 * notify idle governor to avoid reduplicative
> +		 * prediction computation
> +		 */

This comment is hard to understand.

What does it mean, really?

> +		dev->idle_stat.predicted = true;
> +		if (dev->idle_stat.predicted_us < overhead_threshold) {
> +			/*
> +			 * notify tick subsystem to keep ticking
> +			 * for the coming idle
> +			 */
> +			dev->idle_stat.fast_idle = true;
> +		} else
> +			dev->idle_stat.fast_idle = false;

What about

		dev->idle_stat.fast_idle = dev->idle_stat.predicted_us < overhead_threshold;

Also why don't you use dev->idle_stat.overhead directly here?

And what is the basic idea here?  Why do we compare the predicted
idle duration with the entry/exit overhead from the previous cycle
(if I understood this correctly, that is)?

> +	} else {
> +		dev->idle_stat.predicted = false;
> +		dev->idle_stat.fast_idle = false;
> +	}
> +}
> +
> +/**
>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>   */
>  void cpuidle_install_idle_handler(void)
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 6bed197..90b2a10 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	if (unlikely(latency_req == 0))
>  		return 0;
>  
> +	/*don't predict again if idle framework already did it */
> +	if (!dev->idle_stat.predicted)
> +		menu_predict();
> +	else
> +		dev->idle_stat.predicted = false;

We provide the callback which is going to be used by the core if present,
so why would the core not use it after all?

> +
>  	if (CPUIDLE_DRIVER_STATE_START > 0) {
>  		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
>  		unsigned int polling_threshold;
> @@ -518,6 +524,7 @@ static struct cpuidle_governor menu_governor = {
>  	.enable =	menu_enable_device,
>  	.select =	menu_select,
>  	.reflect =	menu_reflect,
> +	.predict =	menu_predict,
>  };
>  
>  /**
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index cad9b71..9ca0288 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -143,6 +143,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
>  			  struct cpuidle_device *dev);
>  extern void cpuidle_entry_start(void);
>  extern void cpuidle_entry_end(void);
> +extern void cpuidle_predict(void);
>  extern int cpuidle_enter(struct cpuidle_driver *drv,
>  			 struct cpuidle_device *dev, int index);
>  extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
> @@ -178,6 +179,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
>  {return -ENODEV; }
>  static inline void cpuidle_entry_start(void) { }
>  static inline void cpuidle_entry_end(void) { }
> +static inline void cpuidle_predict(void) { }
>  static inline int cpuidle_enter(struct cpuidle_driver *drv,
>  				struct cpuidle_device *dev, int index)
>  {return -ENODEV; }
> @@ -255,6 +257,7 @@ struct cpuidle_governor {
>  	int  (*select)		(struct cpuidle_driver *drv,
>  					struct cpuidle_device *dev);
>  	void (*reflect)		(struct cpuidle_device *dev, int index);
> +	unsigned int (*predict)(void);
>  };
>  
>  #ifdef CONFIG_CPU_IDLE
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 0951dac..8704f3c 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -225,6 +225,7 @@ static void do_idle(void)
>  	 */
>  	__current_set_polling();
>  	quiet_vmstat();
> +	cpuidle_predict();
>  	tick_nohz_idle_enter();
>  	cpuidle_entry_end();
>  
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-09-30  7:20 ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Aubrey Li
@ 2017-10-14  0:51   ` Rafael J. Wysocki
  2017-10-16  3:26     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:51 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:30 AM CEST Aubrey Li wrote:
> If the next idle is expected to be a fast idle, we should keep tick
> on before going into idle
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>

This also can be merged with the previous patch (and the [2/8]) IMO.

> ---
>  drivers/cpuidle/cpuidle.c | 14 ++++++++++++++
>  include/linux/cpuidle.h   |  2 ++
>  kernel/time/tick-sched.c  |  4 ++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index ef6f7dd..6cb7e17 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -370,6 +370,20 @@ void cpuidle_predict(void)
>  }
>  
>  /**
> + * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
> + * This function can be called in irq exit path, make it as soon as possible
> + */
> +bool cpuidle_fast_idle(void)
> +{
> +	struct cpuidle_device *dev = cpuidle_get_device();
> +
> +	if (!dev)
> +		return false;
> +
> +	return dev->idle_stat.fast_idle;

return dev && dev->idle_stat.fast_idle;

> +}
> +
> +/**
>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>   */
>  void cpuidle_install_idle_handler(void)
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 9ca0288..791db15 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -144,6 +144,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
>  extern void cpuidle_entry_start(void);
>  extern void cpuidle_entry_end(void);
>  extern void cpuidle_predict(void);
> +extern bool cpuidle_fast_idle(void);
>  extern int cpuidle_enter(struct cpuidle_driver *drv,
>  			 struct cpuidle_device *dev, int index);
>  extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
> @@ -180,6 +181,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
>  static inline void cpuidle_entry_start(void) { }
>  static inline void cpuidle_entry_end(void) { }
>  static inline void cpuidle_predict(void) { }
> +static inline void cpuidle_fast_idle(void) {return false; }
>  static inline int cpuidle_enter(struct cpuidle_driver *drv,
>  				struct cpuidle_device *dev, int index)
>  {return -ENODEV; }
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c7a899c..d663fab 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -27,6 +27,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
>  #include <linux/context_tracking.h>
> +#include <linux/cpuidle.h>
>  
>  #include <asm/irq_regs.h>
>  
> @@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>  			return false;
>  	}
>  
> +	if (cpuidle_fast_idle())
> +		return false;
> +
>  	return true;

return !cpuidle_fast_idle();

>  }
>  
> 

And IMO there is quite a bit too much marketing in the "fast_idle" name,
as it seems all about avoiding to stop the tick if the predicted idle
duration is short enough.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-09-30  7:20 ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Aubrey Li
@ 2017-10-14  0:56   ` Rafael J. Wysocki
  2017-10-16  6:46     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:56 UTC (permalink / raw)
  To: Aubrey Li; +Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel

On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
> sleep length indicates how long we'll be idle. Currently, it's updated
> only when tick nohz enters. These patch series make a new requirement
> with tick, so we should keep sleep length updated as needed

So what exactly would be the problem with leaving things as they are?

> ---
>  kernel/time/tick-sched.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index d663fab..94fb9b8 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>   */
>  ktime_t tick_nohz_get_sleep_length(void)
>  {
> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>  
> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
> +
>  	return ts->sleep_length;
>  }
>  
> 

I probably wouldn't do it this way ...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-09-30  7:20 ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Aubrey Li
@ 2017-10-14  0:59   ` Rafael J. Wysocki
  2017-10-16  6:00     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  0:59 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
> Add a knob to make fast idle threshold tunable
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>

I first of all am not sure about the need to add a tunable for this at all
in the first place.

> ---
>  drivers/cpuidle/cpuidle.c |  3 ++-
>  include/linux/cpuidle.h   |  1 +
>  kernel/sysctl.c           | 12 ++++++++++++
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 6cb7e17..5d4f0b6 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -35,6 +35,7 @@ LIST_HEAD(cpuidle_detected_devices);
>  static int enabled_devices;
>  static int off __read_mostly;
>  static int initialized __read_mostly;
> +int sysctl_fast_idle_ratio = 10;
>  
>  int cpuidle_disabled(void)
>  {
> @@ -346,7 +347,7 @@ void cpuidle_predict(void)
>  	if (!dev)
>  		return;
>  
> -	overhead_threshold = dev->idle_stat.overhead;
> +	overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
>  
>  	if (cpuidle_curr_governor->predict) {
>  		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 791db15..45b8264 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -24,6 +24,7 @@ struct module;
>  struct cpuidle_device;
>  struct cpuidle_driver;
>  
> +extern int sysctl_fast_idle_ratio;
>  
>  /****************************
>   * CPUIDLE DEVICE INTERFACE *
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..97f7e8af 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -67,6 +67,7 @@
>  #include <linux/kexec.h>
>  #include <linux/bpf.h>
>  #include <linux/mount.h>
> +#include <linux/cpuidle.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/processor.h>
> @@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_CPU_IDLE
> +	{
> +		.procname       = "fast_idle_ratio",
> +		.data           = &sysctl_fast_idle_ratio,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = &one,
> +		.extra2         = &one_hundred,
> +	},
> +#endif
>  	{ }
>  };
>  

And if there is a good enough reason to add it, shouldn't the tunable be
there in the cpuidle framework?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction
  2017-09-30  7:20 ` [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction Aubrey Li
@ 2017-10-14  1:01   ` Rafael J. Wysocki
  2017-10-16  6:03     ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  1:01 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:33 AM CEST Aubrey Li wrote:
> Introduce irq timings output as a factor to predict the duration
> of the coming idle
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
>  drivers/cpuidle/Kconfig   |  1 +
>  drivers/cpuidle/cpuidle.c | 17 ++++++++++++++++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 7e48eb5..8b07e1c 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -5,6 +5,7 @@ config CPU_IDLE
>  	default y if ACPI || PPC_PSERIES
>  	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>  	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
> +	select IRQ_TIMINGS
>  	help
>  	  CPU idle is a generic framework for supporting software-controlled
>  	  idle processor power management.  It includes modular cross-platform
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 5d4f0b6..be56cea 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -22,6 +22,7 @@
>  #include <linux/module.h>
>  #include <linux/suspend.h>
>  #include <linux/tick.h>
> +#include <linux/interrupt.h>
>  #include <trace/events/power.h>
>  
>  #include "cpuidle.h"
> @@ -342,13 +343,27 @@ void cpuidle_entry_end(void)
>  void cpuidle_predict(void)
>  {
>  	struct cpuidle_device *dev = cpuidle_get_device();
> -	unsigned int overhead_threshold;
> +	unsigned int idle_interval, overhead_threshold;
> +	u64 now, next_evt;
>  
>  	if (!dev)
>  		return;
>  
>  	overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
>  
> +	/*
> +	 * check irq timings if the next event is coming soon
> +	 */
> +	now = local_clock();
> +	local_irq_disable();
> +	next_evt = irq_timings_next_event(now);
> +	local_irq_enable();
> +	idle_interval = div_u64(next_evt - now, NSEC_PER_USEC);

Another division ...

> +	if (idle_interval < overhead_threshold) {
> +		dev->idle_stat.fast_idle = true;
> +		return;
> +	}
> +
>  	if (cpuidle_curr_governor->predict) {
>  		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
>  		/*
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle to make idle prediction
  2017-09-30  7:20 ` [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle " Aubrey Li
@ 2017-10-14  1:02   ` Rafael J. Wysocki
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  1:02 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel, Aubrey Li

On Saturday, September 30, 2017 9:20:34 AM CEST Aubrey Li wrote:
> Introduce run queue average idle in scheduler as a factor to make
> idle prediction
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
>  drivers/cpuidle/cpuidle.c | 12 ++++++++++++
>  include/linux/cpuidle.h   |  1 +
>  kernel/sched/idle.c       |  5 +++++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index be56cea..9424a2d 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -364,6 +364,18 @@ void cpuidle_predict(void)
>  		return;
>  	}
>  
> +	/*
> +	 * check scheduler if the coming idle is likely a fast idle
> +	 */
> +	idle_interval = div_u64(sched_idle_avg(), NSEC_PER_USEC);

And one more division ...

> +	if (idle_interval < overhead_threshold) {
> +		dev->idle_stat.fast_idle = true;
> +		return;
> +	}
> +
> +	/*
> +	 * check the idle governor if the coming idle is likely a fast idle
> +	 */
>  	if (cpuidle_curr_governor->predict) {
>  		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
>  		/*
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 45b8264..387d72b 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -234,6 +234,7 @@ static inline void cpuidle_use_deepest_state(bool enable)
>  /* kernel/sched/idle.c */
>  extern void sched_idle_set_state(struct cpuidle_state *idle_state);
>  extern void default_idle_call(void);
> +extern u64 sched_idle_avg(void);
>  
>  #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
>  void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a);
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 8704f3c..d23b472 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -30,6 +30,11 @@ void sched_idle_set_state(struct cpuidle_state *idle_state)
>  	idle_set_state(this_rq(), idle_state);
>  }
>  
> +u64 sched_idle_avg(void)
> +{
> +	return this_rq()->avg_idle;
> +}
> +
>  static int __read_mostly cpu_idle_force_poll;
>  
>  void cpu_idle_poll_ctrl(bool enable)
> 

You could easily combine this patch with the previous one IMO.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (7 preceding siblings ...)
  2017-09-30  7:20 ` [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle " Aubrey Li
@ 2017-10-14  1:14 ` Rafael J. Wysocki
  2017-10-16  7:44   ` Li, Aubrey
  2017-11-30  1:00 ` Li, Aubrey
  9 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  1:14 UTC (permalink / raw)
  To: Aubrey Li; +Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel

On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
> We found under some latency intensive workloads, short idle periods occurs
> very common, then idle entry and exit path starts to dominate, so it's
> important to optimize them. To determine the short idle pattern, we need
> to figure out how long of the coming idle and the threshold of the short
> idle interval.
> 
> A cpu idle prediction functionality is introduced in this proposal to catch
> the short idle pattern.
> 
> Firstly, we check the IRQ timings subsystem, if there is an event
> coming soon.
> -- https://lwn.net/Articles/691297/
> 
> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> go into a short idle.
> -- https://patchwork.kernel.org/patch/2839221/
> 
> Thirdly, we predict the next idle interval by using the prediction
> fucntionality in the idle governor if it has.
> 
> For the threshold of the short idle interval, we record the timestamps of
> the idle entry, and multiply by a tunable parameter at here:
> -- /proc/sys/kernel/fast_idle_ratio
> 
> We use the output of the idle prediction to skip turning tick off if a
> short idle is determined in this proposal. Reprogramming hardware timer
> twice(off and on) is expensive for a very short idle. There are some
> potential optimizations can be done according to the same indicator.
> 
> I observed when system is idle, the idle predictor reports 20/s long idle
> and ZERO fast idle on one CPU. And when the workload is running, the idle
> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> 
> Aubrey Li (8):
>   cpuidle: menu: extract prediction functionality
>   cpuidle: record the overhead of idle entry
>   cpuidle: add a new predict interface
>   tick/nohz: keep tick on for a fast idle
>   timers: keep sleep length updated as needed
>   cpuidle: make fast idle threshold tunable
>   cpuidle: introduce irq timing to make idle prediction
>   cpuidle: introduce run queue average idle to make idle prediction
> 
>  drivers/cpuidle/Kconfig          |   1 +
>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
>  include/linux/cpuidle.h          |  21 ++++++++
>  kernel/sched/idle.c              |  14 ++++-
>  kernel/sysctl.c                  |  12 +++++
>  kernel/time/tick-sched.c         |   7 +++
>  7 files changed, 209 insertions(+), 24 deletions(-)
> 

Overall, it looks like you could avoid stopping the tick every time the
predicted idle duration is not longer than the tick interval in the first
place.

Why don't you do that?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
  2017-09-30  7:20 ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Aubrey Li
  2017-10-14  0:45   ` Rafael J. Wysocki
@ 2017-10-14  1:27   ` Rafael J. Wysocki
  2017-10-16  9:52     ` Li, Aubrey
  1 sibling, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-14  1:27 UTC (permalink / raw)
  To: Aubrey Li; +Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel

On Saturday, September 30, 2017 9:20:29 AM CEST Aubrey Li wrote:
> For the governor has predict functionality, add a new predict
> interface in cpuidle framework to call and use it.
> ---
>  drivers/cpuidle/cpuidle.c        | 34 ++++++++++++++++++++++++++++++++++
>  drivers/cpuidle/governors/menu.c |  7 +++++++
>  include/linux/cpuidle.h          |  3 +++
>  kernel/sched/idle.c              |  1 +
>  4 files changed, 45 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 4066308..ef6f7dd 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
>  }
>  
>  /**
> + * cpuidle_predict - predict whether the coming idle is a fast idle or not
> + */
> +void cpuidle_predict(void)
> +{
> +	struct cpuidle_device *dev = cpuidle_get_device();
> +	unsigned int overhead_threshold;
> +
> +	if (!dev)
> +		return;
> +
> +	overhead_threshold = dev->idle_stat.overhead;
> +
> +	if (cpuidle_curr_governor->predict) {
> +		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
> +		/*
> +		 * notify idle governor to avoid reduplicative
> +		 * prediction computation
> +		 */
> +		dev->idle_stat.predicted = true;
> +		if (dev->idle_stat.predicted_us < overhead_threshold) {
> +			/*
> +			 * notify tick subsystem to keep ticking
> +			 * for the coming idle
> +			 */
> +			dev->idle_stat.fast_idle = true;
> +		} else
> +			dev->idle_stat.fast_idle = false;
> +	} else {
> +		dev->idle_stat.predicted = false;
> +		dev->idle_stat.fast_idle = false;
> +	}
> +}
> +
> +/**
>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>   */
>  void cpuidle_install_idle_handler(void)
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 6bed197..90b2a10 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	if (unlikely(latency_req == 0))
>  		return 0;
>  
> +	/*don't predict again if idle framework already did it */
> +	if (!dev->idle_stat.predicted)
> +		menu_predict();
> +	else
> +		dev->idle_stat.predicted = false;
> +
>  	if (CPUIDLE_DRIVER_STATE_START > 0) {
>  		struct cpuidle_state *s = &drv->states[CPUIDLE_DRIVER_STATE_START];
>  		unsigned int polling_threshold;
> @@ -518,6 +524,7 @@ static struct cpuidle_governor menu_governor = {
>  	.enable =	menu_enable_device,
>  	.select =	menu_select,
>  	.reflect =	menu_reflect,
> +	.predict =	menu_predict,
>  };
>  
>  /**
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index cad9b71..9ca0288 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -143,6 +143,7 @@ extern int cpuidle_select(struct cpuidle_driver *drv,
>  			  struct cpuidle_device *dev);
>  extern void cpuidle_entry_start(void);
>  extern void cpuidle_entry_end(void);
> +extern void cpuidle_predict(void);
>  extern int cpuidle_enter(struct cpuidle_driver *drv,
>  			 struct cpuidle_device *dev, int index);
>  extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
> @@ -178,6 +179,7 @@ static inline int cpuidle_select(struct cpuidle_driver *drv,
>  {return -ENODEV; }
>  static inline void cpuidle_entry_start(void) { }
>  static inline void cpuidle_entry_end(void) { }
> +static inline void cpuidle_predict(void) { }
>  static inline int cpuidle_enter(struct cpuidle_driver *drv,
>  				struct cpuidle_device *dev, int index)
>  {return -ENODEV; }
> @@ -255,6 +257,7 @@ struct cpuidle_governor {
>  	int  (*select)		(struct cpuidle_driver *drv,
>  					struct cpuidle_device *dev);
>  	void (*reflect)		(struct cpuidle_device *dev, int index);
> +	unsigned int (*predict)(void);
>  };
>  
>  #ifdef CONFIG_CPU_IDLE
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 0951dac..8704f3c 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -225,6 +225,7 @@ static void do_idle(void)
>  	 */
>  	__current_set_polling();
>  	quiet_vmstat();
> +	cpuidle_predict();

One more question here.

This changes the cpuidle code ordering such that if the ->predict callback
is present, the governor prediction will run before tick_nohz_idle_enter(),
whereas without that callback it runs in cpuidle_idle_call().

Is that actually going to work correctly for the menu governor, in particular?

>  	tick_nohz_idle_enter();
>  	cpuidle_entry_end();
>  
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality
  2017-10-14  0:26   ` Rafael J. Wysocki
@ 2017-10-16  2:46     ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  2:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:26, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:27 AM CEST Aubrey Li wrote:
>> There are several factors in the menu governor to predict the next
>> idle interval:
>> - the next timer
>> - the recent idle interval history
>> - the corrected idle interval pattern
>> These factors are common enough to be extracted to be one function.
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This patch alone would break things AFAICS, because it removes code from
> menu_select() without a replacement (and menu_predict() is never called
> just yet).
> 
> Please always do your best to ensure that things will work after *every*
> patch in a series.

okay, I'll correct this in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-10-14  0:35   ` Rafael J. Wysocki
@ 2017-10-16  3:11     ` Li, Aubrey
  2017-10-17  0:05       ` Rafael J. Wysocki
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  3:11 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:35, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>> Record the overhead of idle entry in micro-second
>>
> 
> What is this needed for?

We need to figure out how long of a idle is a short idle and recording
the overhead is for this purpose. The short idle threshold is based
on this overhead.

> 
>> +void cpuidle_entry_end(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +	u64 overhead;
>> +	s64 diff;
>> +
>> +	if (dev) {
>> +		dev->idle_stat.entry_end = local_clock();
>> +		overhead = div_u64(dev->idle_stat.entry_end -
>> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
> 
> Is the conversion really necessary?
> 
> If so, then why?

We can choose nano-second and micro-second. Given that workload results
in the short idle pattern, I think micro-second is good enough for the
real workload.

Another reason is that prediction from idle governor is micro-second, so
I convert it for comparing purpose.
> 
> And if there is a good reason, what about using right shift to do
> an approximate conversion to avoid the extra division here?

Sure >> 10 works for me as I don't think here precision is a big deal.

> 
>> +		diff = overhead - dev->idle_stat.overhead;
>> +		dev->idle_stat.overhead += diff >> 3;
> 
> Can you please explain what happens in the two lines above?

Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.
> 
>> +		/*
>> +		 * limit overhead to 1us
>> +		 */
>> +		if (dev->idle_stat.overhead == 0)
>> +			dev->idle_stat.overhead = 1;
>> +	}
>> +}
>> +
>>  /**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
>> index fc1e5d7..cad9b71 100644
>> --- a/include/linux/cpuidle.h
>> +++ b/include/linux/cpuidle.h
>> @@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
>>  struct cpuidle_state_kobj;
>>  struct cpuidle_driver_kobj;
>>  
>> +struct cpuidle_stat {
>> +	u64			entry_start;	/* nanosecond */
>> +	u64			entry_end;	/* nanosecond */
>> +	u64			overhead;	/* nanosecond */
>> +	unsigned int		predicted_us;	/* microsecond */
>> +	bool			predicted;	/* ever predicted? */
>> +	bool			fast_idle;	/* fast idle? */
> 
> Some of the fields here are never used in the code after this patch.
> 
> Also it would be good to add a comment describing the meaning of the
> fields.
> 
okay, will add in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-14  0:51   ` Rafael J. Wysocki
@ 2017-10-16  3:26     ` Li, Aubrey
  2017-10-16  4:45       ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  3:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:51, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:30 AM CEST Aubrey Li wrote:
>> If the next idle is expected to be a fast idle, we should keep tick
>> on before going into idle
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This also can be merged with the previous patch (and the [2/8]) IMO.
> 

okay, will merge in the next version.

>> ---
>>  drivers/cpuidle/cpuidle.c | 14 ++++++++++++++
>>  include/linux/cpuidle.h   |  2 ++
>>  kernel/time/tick-sched.c  |  4 ++++
>>  3 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index ef6f7dd..6cb7e17 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -370,6 +370,20 @@ void cpuidle_predict(void)
>>  }
>>  
>>  /**
>> + * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
>> + * This function can be called in irq exit path, make it as soon as possible
>> + */
>> +bool cpuidle_fast_idle(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +
>> +	if (!dev)
>> +		return false;
>> +
>> +	return dev->idle_stat.fast_idle;
> 
> return dev && dev->idle_stat.fast_idle;

Thanks!

>>  
>> @@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>>  			return false;
>>  	}
>>  
>> +	if (cpuidle_fast_idle())
>> +		return false;
>> +
>>  	return true;
> 
> return !cpuidle_fast_idle();

And thanks!
> 
>>  }
>>  
>>
> 
> And IMO there is quite a bit too much marketing in the "fast_idle" name,
> as it seems all about avoiding to stop the tick if the predicted idle
> duration is short enough.
> 

okay, and what's your suggestion? :)
I'll try to move quiet_vmstat() into the normal idle branch if this patch series
are reasonable. Is fast_idle a good indication for it?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  3:26     ` Li, Aubrey
@ 2017-10-16  4:45       ` Mike Galbraith
  2017-10-16  5:34         ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2017-10-16  4:45 UTC (permalink / raw)
  To: Li, Aubrey, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
> 
> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
> are reasonable. Is fast_idle a good indication for it?

see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code

	-Mike

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  4:45       ` Mike Galbraith
@ 2017-10-16  5:34         ` Li, Aubrey
  2017-10-16  6:25           ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  5:34 UTC (permalink / raw)
  To: Mike Galbraith, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/16 12:45, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>
>> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
>> are reasonable. Is fast_idle a good indication for it?
> 
> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code

It looks like this commit makes tick stop critical as it can be invoked in interrupt
exit path?

->smp_apic_timer_interrupt
-->irq_exit
--->tick_nohz_irq_exit
---->__tick_nohz_idle_enter
----->tick_nohz_stop_sched_tick
------>quiet_vmstat

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-10-14  0:59   ` Rafael J. Wysocki
@ 2017-10-16  6:00     ` Li, Aubrey
  2017-10-17  0:01       ` Rafael J. Wysocki
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:59, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>> Add a knob to make fast idle threshold tunable
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> I first of all am not sure about the need to add a tunable for this at all
> in the first place.

Actually I think a fixed value(10) might be good enough but not quite sure
if there is a requirement to tune it for different scenario, for example even
if the predicted idle interval is 100x overhead, I still want a fast path for
a better benchmark score?

>> @@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
>>  		.extra2		= &one,
>>  	},
>>  #endif
>> +#ifdef CONFIG_CPU_IDLE
>> +	{
>> +		.procname       = "fast_idle_ratio",
>> +		.data           = &sysctl_fast_idle_ratio,
>> +		.maxlen         = sizeof(int),
>> +		.mode           = 0644,
>> +		.proc_handler   = proc_dointvec_minmax,
>> +		.extra1         = &one,
>> +		.extra2         = &one_hundred,
>> +	},
>> +#endif
>>  	{ }
>>  };
>>  
> 
> And if there is a good enough reason to add it, shouldn't the tunable be
> there in the cpuidle framework?
> 
sure, if it makes sense, I'll move it into cpuidle/sysfs.c

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction
  2017-10-14  1:01   ` Rafael J. Wysocki
@ 2017-10-16  6:03     ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:03 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, x86, linux-kernel

On 2017/10/14 9:01, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:33 AM CEST Aubrey Li wrote:
>> Introduce irq timings output as a factor to predict the duration
>> of the coming idle
>>
>> @@ -342,13 +343,27 @@ void cpuidle_entry_end(void)
>>  void cpuidle_predict(void)
>>  {
>>  	struct cpuidle_device *dev = cpuidle_get_device();
>> -	unsigned int overhead_threshold;
>> +	unsigned int idle_interval, overhead_threshold;
>> +	u64 now, next_evt;
>>  
>>  	if (!dev)
>>  		return;
>>  
>>  	overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
>>  
>> +	/*
>> +	 * check irq timings if the next event is coming soon
>> +	 */
>> +	now = local_clock();
>> +	local_irq_disable();
>> +	next_evt = irq_timings_next_event(now);
>> +	local_irq_enable();
>> +	idle_interval = div_u64(next_evt - now, NSEC_PER_USEC);
> 
> Another division ...
> 
okay, will replace with >>10 if micro-second is agreed to be the comparing unit

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  5:34         ` Li, Aubrey
@ 2017-10-16  6:25           ` Mike Galbraith
  2017-10-16  6:31             ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2017-10-16  6:25 UTC (permalink / raw)
  To: Li, Aubrey, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On Mon, 2017-10-16 at 13:34 +0800, Li, Aubrey wrote:
> On 2017/10/16 12:45, Mike Galbraith wrote:
> > On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
> >>
> >> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
> >> are reasonable. Is fast_idle a good indication for it?
> > 
> > see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code
> 
> It looks like this commit makes tick stop critical as it can be invoked in interrupt
> exit path?

do_idle() ain't critical?  It is in my book.  Hopefully, you're about
to make that idle_stat.fast_idle thingy liberal enough that we cease
mucking about with the tick on every microscopic idle (and I can then
trash my years old local patch to avoid needlessly eating that cost).

	-Mike

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  6:25           ` Mike Galbraith
@ 2017-10-16  6:31             ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:31 UTC (permalink / raw)
  To: Mike Galbraith, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/16 14:25, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 13:34 +0800, Li, Aubrey wrote:
>> On 2017/10/16 12:45, Mike Galbraith wrote:
>>> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>>>
>>>> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
>>>> are reasonable. Is fast_idle a good indication for it?
>>>
>>> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code
>>
>> It looks like this commit makes tick stop critical as it can be invoked in interrupt
>> exit path?
> 
> do_idle() ain't critical?  It is in my book.  Hopefully, you're about
> to make that idle_stat.fast_idle thingy liberal enough that we cease
> mucking about with the tick on every microscopic idle (and I can then
> trash my years old local patch to avoid needlessly eating that cost).
> 

Welcome any feedback of other patch set, Mike, :)

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-10-14  0:56   ` Rafael J. Wysocki
@ 2017-10-16  6:46     ` Li, Aubrey
  2017-10-16 23:58       ` Rafael J. Wysocki
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:56, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>> sleep length indicates how long we'll be idle. Currently, it's updated
>> only when tick nohz enters. These patch series make a new requirement
>> with tick, so we should keep sleep length updated as needed
> 
> So what exactly would be the problem with leaving things as they are?

Previously ts->sleep_length is only updated when tick is stopped.

As follows, in

__tick_nohz_idle_enter()
{
	if (can_stop_idle_tick() /* return true */) {
		tick_nohz_stop_sched_tick()
			|
			|-----> update sleep_length
	}
}

Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
to update sleep_length every time we read it

If we leave it unchanged, the prediction could read a sleep_length long time
ago if the system keep ticking.
> 
>> ---
>>  kernel/time/tick-sched.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index d663fab..94fb9b8 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>   */
>>  ktime_t tick_nohz_get_sleep_length(void)
>>  {
>> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>  
>> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>> +
>>  	return ts->sleep_length;
>>  }
>>  
>>
> 
> I probably wouldn't do it this way ...
> 
> 

May I know the detailed thoughts?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-10-14  1:14 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Rafael J. Wysocki
@ 2017-10-16  7:44   ` Li, Aubrey
  2017-10-17  0:07     ` Rafael J. Wysocki
  0 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  7:44 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
>> We found under some latency intensive workloads, short idle periods occurs
>> very common, then idle entry and exit path starts to dominate, so it's
>> important to optimize them. To determine the short idle pattern, we need
>> to figure out how long of the coming idle and the threshold of the short
>> idle interval.
>>
>> A cpu idle prediction functionality is introduced in this proposal to catch
>> the short idle pattern.
>>
>> Firstly, we check the IRQ timings subsystem, if there is an event
>> coming soon.
>> -- https://lwn.net/Articles/691297/
>>
>> Secondly, we check the idle statistics of scheduler, if it's likely we'll
>> go into a short idle.
>> -- https://patchwork.kernel.org/patch/2839221/
>>
>> Thirdly, we predict the next idle interval by using the prediction
>> fucntionality in the idle governor if it has.
>>
>> For the threshold of the short idle interval, we record the timestamps of
>> the idle entry, and multiply by a tunable parameter at here:
>> -- /proc/sys/kernel/fast_idle_ratio
>>
>> We use the output of the idle prediction to skip turning tick off if a
>> short idle is determined in this proposal. Reprogramming hardware timer
>> twice(off and on) is expensive for a very short idle. There are some
>> potential optimizations can be done according to the same indicator.
>>
>> I observed when system is idle, the idle predictor reports 20/s long idle
>> and ZERO fast idle on one CPU. And when the workload is running, the idle
>> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>>
>> Aubrey Li (8):
>>   cpuidle: menu: extract prediction functionality
>>   cpuidle: record the overhead of idle entry
>>   cpuidle: add a new predict interface
>>   tick/nohz: keep tick on for a fast idle
>>   timers: keep sleep length updated as needed
>>   cpuidle: make fast idle threshold tunable
>>   cpuidle: introduce irq timing to make idle prediction
>>   cpuidle: introduce run queue average idle to make idle prediction
>>
>>  drivers/cpuidle/Kconfig          |   1 +
>>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
>>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
>>  include/linux/cpuidle.h          |  21 ++++++++
>>  kernel/sched/idle.c              |  14 ++++-
>>  kernel/sysctl.c                  |  12 +++++
>>  kernel/time/tick-sched.c         |   7 +++
>>  7 files changed, 209 insertions(+), 24 deletions(-)
>>
> 
> Overall, it looks like you could avoid stopping the tick every time the
> predicted idle duration is not longer than the tick interval in the first
> place.
> > Why don't you do that?

I didn't catch this.

Are you suggesting?

if(!cpu_stat.fast_idle)
	tick_nohz_idle_enter()

Or you concern why the threshold can't simply be tick interval?

For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
For the latter, if the threshold is close/equal to the tick, it's quite possible
the next event is the tick and no other else event.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
  2017-10-14  0:45   ` Rafael J. Wysocki
@ 2017-10-16  8:04     ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  8:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:45, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:29 AM CEST Aubrey Li wrote:
>> For the governor has predict functionality, add a new predict
>> interface in cpuidle framework to call and use it.
> 
> Care to describe how it is intended to work?
> 
> Also this patch uses data structures introduced in the previous one
> (and the previous one didn't use them), so maybe merge the two?

okay, will refine in the next version.

> 
>> ---
>>  drivers/cpuidle/cpuidle.c        | 34 ++++++++++++++++++++++++++++++++++
>>  drivers/cpuidle/governors/menu.c |  7 +++++++
>>  include/linux/cpuidle.h          |  3 +++
>>  kernel/sched/idle.c              |  1 +
>>  4 files changed, 45 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index 4066308..ef6f7dd 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
>>  }
>>  
>>  /**
>> + * cpuidle_predict - predict whether the coming idle is a fast idle or not
>> + */
>> +void cpuidle_predict(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +	unsigned int overhead_threshold;
>> +
>> +	if (!dev)
>> +		return;
>> +
>> +	overhead_threshold = dev->idle_stat.overhead;
>> +
>> +	if (cpuidle_curr_governor->predict) {
>> +		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
>> +		/*
>> +		 * notify idle governor to avoid reduplicative
>> +		 * prediction computation
>> +		 */
> 
> This comment is hard to understand.
> 
> What does it mean, really?
> 

predict() does a lot of computation. If it's called in cpuidle_predict(),
we don't want it to be called in menu governor again. So here I use a flag to
tell menu governor if predict computation is already done for the coming idle
this time.

>> +		dev->idle_stat.predicted = true;
>> +		if (dev->idle_stat.predicted_us < overhead_threshold) {
>> +			/*
>> +			 * notify tick subsystem to keep ticking
>> +			 * for the coming idle
>> +			 */
>> +			dev->idle_stat.fast_idle = true;
>> +		} else
>> +			dev->idle_stat.fast_idle = false;
> 
> What about
> 
> 		dev->idle_stat.fast_idle = dev->idle_stat.predicted_us < overhead_threshold;

Ack

> 
> Also why don't you use dev->idle_stat.overhead directly here?

Yeah, I should merge a few patches, because dev->idle_stat.overhead is referenced many times,
for irq timings, for scheduler idle statistics and for idle governor.

> 
> And what is the basic idea here?  Why do we compare the predicted
> idle duration with the entry/exit overhead from the previous cycle
> (if I understood this correctly, that is)?

entry/exit overhead is an average here, and the variance should not be big.

> 
>> +	} else {
>> +		dev->idle_stat.predicted = false;
>> +		dev->idle_stat.fast_idle = false;
>> +	}
>> +}
>> +
>> +/**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>>  void cpuidle_install_idle_handler(void)
>> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
>> index 6bed197..90b2a10 100644
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>>  	if (unlikely(latency_req == 0))
>>  		return 0;
>>  
>> +	/*don't predict again if idle framework already did it */
>> +	if (!dev->idle_stat.predicted)
>> +		menu_predict();
>> +	else
>> +		dev->idle_stat.predicted = false;
> 
> We provide the callback which is going to be used by the core if present,
> so why would the core not use it after all?

There is a case that in the loop

while (!need_resched()) {
	cpuidle_idle_call()
}

The CPU receives a timer interrupt and wake up and find nothing to be scheduled
and back to call menu then mwait to sleep again.

Under this case, cpuidle_predict() is not reached but the idle data for the 
prediction needs to be updated.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
  2017-10-14  1:27   ` Rafael J. Wysocki
@ 2017-10-16  9:52     ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-16  9:52 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 9:27, Rafael J. Wysocki wrote:
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 0951dac..8704f3c 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -225,6 +225,7 @@ static void do_idle(void)
>>  	 */
>>  	__current_set_polling();
>>  	quiet_vmstat();
>> +	cpuidle_predict();
> 
> One more question here.
> 
> This changes the cpuidle code ordering such that if the ->predict callback
> is present, the governor prediction will run before tick_nohz_idle_enter(),
> whereas without that callback it runs in cpuidle_idle_call().
> 
> Is that actually going to work correctly for the menu governor, in particular?
>

This depends on how prediction works, patch 1/8 has the details.

The first factor we get is the next clock event from tick device, this is not
related to cpuidle call.

The other two factors are derived from idle interval history, the data is already
maintained in menu_device data structure.

I'm not sure if this can address your concern, or anything else I can provide to
help this.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-10-16  6:46     ` Li, Aubrey
@ 2017-10-16 23:58       ` Rafael J. Wysocki
  2017-10-17  6:10         ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-16 23:58 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 8:46:41 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:56, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
> >> sleep length indicates how long we'll be idle. Currently, it's updated
> >> only when tick nohz enters. These patch series make a new requirement
> >> with tick, so we should keep sleep length updated as needed
> > 
> > So what exactly would be the problem with leaving things as they are?
> 
> Previously ts->sleep_length is only updated when tick is stopped.
> 
> As follows, in
> 
> __tick_nohz_idle_enter()
> {
> 	if (can_stop_idle_tick() /* return true */) {
> 		tick_nohz_stop_sched_tick()
> 			|
> 			|-----> update sleep_length
> 	}
> }

Which is logical, because the tick will get in the way if we don't stop it,
won't it?

> 
> Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
> to update sleep_length every time we read it
> 
> If we leave it unchanged, the prediction could read a sleep_length long time
> ago if the system keep ticking.

Well, but does it make sense to estimate the sleep length without stopping
the tick?

> >> ---
> >>  kernel/time/tick-sched.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> >> index d663fab..94fb9b8 100644
> >> --- a/kernel/time/tick-sched.c
> >> +++ b/kernel/time/tick-sched.c
> >> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
> >>   */
> >>  ktime_t tick_nohz_get_sleep_length(void)
> >>  {
> >> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> >>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> >>  
> >> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
> >> +
> >>  	return ts->sleep_length;
> >>  }
> >>  
> >>
> > 
> > I probably wouldn't do it this way ...
> > 
> > 
> 
> May I know the detailed thoughts?

That depends on the answer above. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-10-16  6:00     ` Li, Aubrey
@ 2017-10-17  0:01       ` Rafael J. Wysocki
  2017-10-17  6:12         ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:01 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 8:00:45 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:59, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
> >> Add a knob to make fast idle threshold tunable
> >>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > 
> > I first of all am not sure about the need to add a tunable for this at all
> > in the first place.
> 
> Actually I think a fixed value(10) might be good enough but not quite sure
> if there is a requirement to tune it for different scenario, for example even
> if the predicted idle interval is 100x overhead, I still want a fast path for
> a better benchmark score?

Any new tunables make the test matrix expand considerably, so it generally is
better to err on the conservative side with adding them.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-10-16  3:11     ` Li, Aubrey
@ 2017-10-17  0:05       ` Rafael J. Wysocki
  2017-10-17  7:04         ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:05 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:35, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
> >> Record the overhead of idle entry in micro-second
> >>
> > 
> > What is this needed for?
> 
> We need to figure out how long of a idle is a short idle and recording
> the overhead is for this purpose. The short idle threshold is based
> on this overhead.

I don't really understand this statement.

Pretent I'm not familiar with this stuff and try to explain it to me. :-)

> > 
> >> +void cpuidle_entry_end(void)
> >> +{
> >> +	struct cpuidle_device *dev = cpuidle_get_device();
> >> +	u64 overhead;
> >> +	s64 diff;
> >> +
> >> +	if (dev) {
> >> +		dev->idle_stat.entry_end = local_clock();
> >> +		overhead = div_u64(dev->idle_stat.entry_end -
> >> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
> > 
> > Is the conversion really necessary?
> > 
> > If so, then why?
> 
> We can choose nano-second and micro-second. Given that workload results
> in the short idle pattern, I think micro-second is good enough for the
> real workload.
> 
> Another reason is that prediction from idle governor is micro-second, so
> I convert it for comparing purpose.
> > 
> > And if there is a good reason, what about using right shift to do
> > an approximate conversion to avoid the extra division here?
> 
> Sure >> 10 works for me as I don't think here precision is a big deal.
> 
> > 
> >> +		diff = overhead - dev->idle_stat.overhead;
> >> +		dev->idle_stat.overhead += diff >> 3;
> > 
> > Can you please explain what happens in the two lines above?
> 
> Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.

OK

Maybe care to add a comment to that effect?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-10-16  7:44   ` Li, Aubrey
@ 2017-10-17  0:07     ` Rafael J. Wysocki
  2017-10-17  7:32       ` Li, Aubrey
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:07 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
> On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
> >> We found under some latency intensive workloads, short idle periods occurs
> >> very common, then idle entry and exit path starts to dominate, so it's
> >> important to optimize them. To determine the short idle pattern, we need
> >> to figure out how long of the coming idle and the threshold of the short
> >> idle interval.
> >>
> >> A cpu idle prediction functionality is introduced in this proposal to catch
> >> the short idle pattern.
> >>
> >> Firstly, we check the IRQ timings subsystem, if there is an event
> >> coming soon.
> >> -- https://lwn.net/Articles/691297/
> >>
> >> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> >> go into a short idle.
> >> -- https://patchwork.kernel.org/patch/2839221/
> >>
> >> Thirdly, we predict the next idle interval by using the prediction
> >> fucntionality in the idle governor if it has.
> >>
> >> For the threshold of the short idle interval, we record the timestamps of
> >> the idle entry, and multiply by a tunable parameter at here:
> >> -- /proc/sys/kernel/fast_idle_ratio
> >>
> >> We use the output of the idle prediction to skip turning tick off if a
> >> short idle is determined in this proposal. Reprogramming hardware timer
> >> twice(off and on) is expensive for a very short idle. There are some
> >> potential optimizations can be done according to the same indicator.
> >>
> >> I observed when system is idle, the idle predictor reports 20/s long idle
> >> and ZERO fast idle on one CPU. And when the workload is running, the idle
> >> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> >>
> >> Aubrey Li (8):
> >>   cpuidle: menu: extract prediction functionality
> >>   cpuidle: record the overhead of idle entry
> >>   cpuidle: add a new predict interface
> >>   tick/nohz: keep tick on for a fast idle
> >>   timers: keep sleep length updated as needed
> >>   cpuidle: make fast idle threshold tunable
> >>   cpuidle: introduce irq timing to make idle prediction
> >>   cpuidle: introduce run queue average idle to make idle prediction
> >>
> >>  drivers/cpuidle/Kconfig          |   1 +
> >>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
> >>  include/linux/cpuidle.h          |  21 ++++++++
> >>  kernel/sched/idle.c              |  14 ++++-
> >>  kernel/sysctl.c                  |  12 +++++
> >>  kernel/time/tick-sched.c         |   7 +++
> >>  7 files changed, 209 insertions(+), 24 deletions(-)
> >>
> > 
> > Overall, it looks like you could avoid stopping the tick every time the
> > predicted idle duration is not longer than the tick interval in the first
> > place.
> > > Why don't you do that?
> 
> I didn't catch this.
> 
> Are you suggesting?
> 
> if(!cpu_stat.fast_idle)
> 	tick_nohz_idle_enter()
> 
> Or you concern why the threshold can't simply be tick interval?

That I guess.

> For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
> For the latter, if the threshold is close/equal to the tick, it's quite possible
> the next event is the tick and no other else event.

Well, I don't quite get that.

What's the reasoning here?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-10-16 23:58       ` Rafael J. Wysocki
@ 2017-10-17  6:10         ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-17  6:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 7:58, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:46:41 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:56, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>>>> sleep length indicates how long we'll be idle. Currently, it's updated
>>>> only when tick nohz enters. These patch series make a new requirement
>>>> with tick, so we should keep sleep length updated as needed
>>>
>>> So what exactly would be the problem with leaving things as they are?
>>
>> Previously ts->sleep_length is only updated when tick is stopped.
>>
>> As follows, in
>>
>> __tick_nohz_idle_enter()
>> {
>> 	if (can_stop_idle_tick() /* return true */) {
>> 		tick_nohz_stop_sched_tick()
>> 			|
>> 			|-----> update sleep_length
>> 	}
>> }
> 
> Which is logical, because the tick will get in the way if we don't stop it,
> won't it?
> 
The scenario here is, tick(4ms) is too long for a short idle(e.g. 4us).
And there could be hundreds of short idle within one tick interval.

>>
>> Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
>> to update sleep_length every time we read it
>>
>> If we leave it unchanged, the prediction could read a sleep_length long time
>> ago if the system keep ticking.
> 
> Well, but does it make sense to estimate the sleep length without stopping
> the tick?

For example, for the first short idle, we just turned tick off last time, so
we may get the sleep length = 3900us. Then we keep tick on, and the 100th short
idle comes, then the original sleep length is still 3900us(not updated), but
actually it should be e.g. 100us.

> 
>>>> ---
>>>>  kernel/time/tick-sched.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>>> index d663fab..94fb9b8 100644
>>>> --- a/kernel/time/tick-sched.c
>>>> +++ b/kernel/time/tick-sched.c
>>>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>>>   */
>>>>  ktime_t tick_nohz_get_sleep_length(void)
>>>>  {
>>>> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>>>>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>>>  
>>>> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>>>> +
>>>>  	return ts->sleep_length;
>>>>  }
>>>>  
>>>>
>>>
>>> I probably wouldn't do it this way ...
>>>
>>>
>>
>> May I know the detailed thoughts?
> 
> That depends on the answer above. :-)

Does the above explanation address the concern?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-10-17  0:01       ` Rafael J. Wysocki
@ 2017-10-17  6:12         ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-17  6:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:01, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:00:45 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:59, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>>>> Add a knob to make fast idle threshold tunable
>>>>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>
>>> I first of all am not sure about the need to add a tunable for this at all
>>> in the first place.
>>
>> Actually I think a fixed value(10) might be good enough but not quite sure
>> if there is a requirement to tune it for different scenario, for example even
>> if the predicted idle interval is 100x overhead, I still want a fast path for
>> a better benchmark score?
> 
> Any new tunables make the test matrix expand considerably, so it generally is
> better to err on the conservative side with adding them.
> 

Okay, it's fine for me without it. I'll remove in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-10-17  0:05       ` Rafael J. Wysocki
@ 2017-10-17  7:04         ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-17  7:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Mike Galbraith
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:05, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:35, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>>>> Record the overhead of idle entry in micro-second
>>>>
>>>
>>> What is this needed for?
>>
>> We need to figure out how long of a idle is a short idle and recording
>> the overhead is for this purpose. The short idle threshold is based
>> on this overhead.
> 
> I don't really understand this statement.
> 
> Pretent I'm not familiar with this stuff and try to explain it to me. :-)
> 

Okay, let me try, :-)

Today what we did in idle loop as follows:

do_idle {
	idle_entry {
	- deferrable stuff like quiet_vmstat
	- turn off tick(without looking at historical/predicted idle interval)
	- rcu idle enter, c-state selection, etc
	}

	idle_call {
	- poll or halt or mwait
	}

	idle_exit {
	- rcu idle exit
	- restore the tick if tick is stopped before enter idle
	}
}

And we already measured idle_entry and idle_exit costs several micro-seconds,
say 10us.

Now if idle_call is 1000us, much larger than idle_entry and idle_exit, we can
ignore the time cost in idle_entry and idle_exit.

But for some workloads with short idle pattern, like netperf, the idle_call
is 2us, then idle_entry and idle_exit start to dominate. If we can reduce the
time in idle_entry and idle_exit, we then get better workload performance
significantly.

Modem high-speed network and low-latency I/O like Nvme disk has this requirement.
Mike's patch was made several years ago though I don't know the details. Here is
an article related to this.
https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext

>>>
>>>> +void cpuidle_entry_end(void)
>>>> +{
>>>> +	struct cpuidle_device *dev = cpuidle_get_device();
>>>> +	u64 overhead;
>>>> +	s64 diff;
>>>> +
>>>> +	if (dev) {
>>>> +		dev->idle_stat.entry_end = local_clock();
>>>> +		overhead = div_u64(dev->idle_stat.entry_end -
>>>> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
>>>
>>> Is the conversion really necessary?
>>>
>>> If so, then why?
>>
>> We can choose nano-second and micro-second. Given that workload results
>> in the short idle pattern, I think micro-second is good enough for the
>> real workload.
>>
>> Another reason is that prediction from idle governor is micro-second, so
>> I convert it for comparing purpose.
>>>
>>> And if there is a good reason, what about using right shift to do
>>> an approximate conversion to avoid the extra division here?
>>
>> Sure >> 10 works for me as I don't think here precision is a big deal.
>>
>>>
>>>> +		diff = overhead - dev->idle_stat.overhead;
>>>> +		dev->idle_stat.overhead += diff >> 3;
>>>
>>> Can you please explain what happens in the two lines above?
>>
>> Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.
> 
> OK
> 
> Maybe care to add a comment to that effect?

Sure, I'll add in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-10-17  0:07     ` Rafael J. Wysocki
@ 2017-10-17  7:32       ` Li, Aubrey
  0 siblings, 0 replies; 42+ messages in thread
From: Li, Aubrey @ 2017-10-17  7:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:07, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
>>
>> Or you concern why the threshold can't simply be tick interval?
> 
> That I guess.
> 
>> For the latter, if the threshold is close/equal to the tick, it's quite possible
>> the next event is the tick and no other else event.
> 
> Well, I don't quite get that.
> 
> What's the reasoning here?

if we set the threshold to the tick interval, when the system keeps the tick on once,
it will have no chance to turn if off forever.

Because we turn on the tick, and count in the time of idle entry and idle exit,
the idle interval always < tick interval, that means idle is always a short idle.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
                   ` (8 preceding siblings ...)
  2017-10-14  1:14 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Rafael J. Wysocki
@ 2017-11-30  1:00 ` Li, Aubrey
  2017-11-30  1:37   ` Rafael J. Wysocki
  9 siblings, 1 reply; 42+ messages in thread
From: Li, Aubrey @ 2017-11-30  1:00 UTC (permalink / raw)
  To: Aubrey Li, tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: linux-pm, linux-kernel

Hi,

Thanks Rafael's comments against V2. I'd like to ping here to see which
direction this proposal should go and what fundamental change this proposal
should make.

I'm also open to any suggestions if my proposal is not on the right way.

Thanks,
-Aubrey

On 2017/9/30 15:20, Aubrey Li wrote:
> We found under some latency intensive workloads, short idle periods occurs
> very common, then idle entry and exit path starts to dominate, so it's
> important to optimize them. To determine the short idle pattern, we need
> to figure out how long of the coming idle and the threshold of the short
> idle interval.
> 
> A cpu idle prediction functionality is introduced in this proposal to catch
> the short idle pattern.
> 
> Firstly, we check the IRQ timings subsystem, if there is an event
> coming soon.
> -- https://lwn.net/Articles/691297/
> 
> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> go into a short idle.
> -- https://patchwork.kernel.org/patch/2839221/
> 
> Thirdly, we predict the next idle interval by using the prediction
> fucntionality in the idle governor if it has.
> 
> For the threshold of the short idle interval, we record the timestamps of
> the idle entry, and multiply by a tunable parameter at here:
> -- /proc/sys/kernel/fast_idle_ratio
> 
> We use the output of the idle prediction to skip turning tick off if a
> short idle is determined in this proposal. Reprogramming hardware timer
> twice(off and on) is expensive for a very short idle. There are some
> potential optimizations can be done according to the same indicator.
> 
> I observed when system is idle, the idle predictor reports 20/s long idle
> and ZERO fast idle on one CPU. And when the workload is running, the idle
> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> 
> Aubrey Li (8):
>   cpuidle: menu: extract prediction functionality
>   cpuidle: record the overhead of idle entry
>   cpuidle: add a new predict interface
>   tick/nohz: keep tick on for a fast idle
>   timers: keep sleep length updated as needed
>   cpuidle: make fast idle threshold tunable
>   cpuidle: introduce irq timing to make idle prediction
>   cpuidle: introduce run queue average idle to make idle prediction
> 
>  drivers/cpuidle/Kconfig          |   1 +
>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
>  include/linux/cpuidle.h          |  21 ++++++++
>  kernel/sched/idle.c              |  14 ++++-
>  kernel/sysctl.c                  |  12 +++++
>  kernel/time/tick-sched.c         |   7 +++
>  7 files changed, 209 insertions(+), 24 deletions(-)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-11-30  1:00 ` Li, Aubrey
@ 2017-11-30  1:37   ` Rafael J. Wysocki
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael J. Wysocki @ 2017-11-30  1:37 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, Thomas Gleixner, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, Andi Kleen, Tim Chen, Linux PM,
	Linux Kernel Mailing List

On Thu, Nov 30, 2017 at 2:00 AM, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> Hi,
>
> Thanks Rafael's comments against V2. I'd like to ping here to see which
> direction this proposal should go and what fundamental change this proposal
> should make.
>
> I'm also open to any suggestions if my proposal is not on the right way.

To me (and I discussed that with Peter briefly last month in Prague)
the way to go would be to avoid stopping the scheduler tick until we
are entirely sure that it should be stopped.  Which is when the
governor (any governor BTW, not just menu) chooses the target idle
state for the CPU and the target residency of that state is long
enough for the tick to be stopped ("long enough" here means that if
the tick is not stopped, the time spent by the CPU in the target idle
state will be shorter than the target residency).  Note that this is
totally independent on what governor is used and it doesn't involve
any "refinements" or "improvements" of the prediction whatever.

Of course, for the governor to make the state selection, it generally
will have to know when the next non-tick timer is going to expire and
getting that information to it will require some effort.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-11-30  1:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30  7:20 [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Aubrey Li
2017-09-30  7:20 ` [RFC PATCH v2 1/8] cpuidle: menu: extract " Aubrey Li
2017-10-14  0:26   ` Rafael J. Wysocki
2017-10-16  2:46     ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Aubrey Li
2017-10-14  0:35   ` Rafael J. Wysocki
2017-10-16  3:11     ` Li, Aubrey
2017-10-17  0:05       ` Rafael J. Wysocki
2017-10-17  7:04         ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Aubrey Li
2017-10-14  0:45   ` Rafael J. Wysocki
2017-10-16  8:04     ` Li, Aubrey
2017-10-14  1:27   ` Rafael J. Wysocki
2017-10-16  9:52     ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Aubrey Li
2017-10-14  0:51   ` Rafael J. Wysocki
2017-10-16  3:26     ` Li, Aubrey
2017-10-16  4:45       ` Mike Galbraith
2017-10-16  5:34         ` Li, Aubrey
2017-10-16  6:25           ` Mike Galbraith
2017-10-16  6:31             ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Aubrey Li
2017-10-14  0:56   ` Rafael J. Wysocki
2017-10-16  6:46     ` Li, Aubrey
2017-10-16 23:58       ` Rafael J. Wysocki
2017-10-17  6:10         ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Aubrey Li
2017-10-14  0:59   ` Rafael J. Wysocki
2017-10-16  6:00     ` Li, Aubrey
2017-10-17  0:01       ` Rafael J. Wysocki
2017-10-17  6:12         ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction Aubrey Li
2017-10-14  1:01   ` Rafael J. Wysocki
2017-10-16  6:03     ` Li, Aubrey
2017-09-30  7:20 ` [RFC PATCH v2 8/8] cpuidle: introduce run queue average idle " Aubrey Li
2017-10-14  1:02   ` Rafael J. Wysocki
2017-10-14  1:14 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Rafael J. Wysocki
2017-10-16  7:44   ` Li, Aubrey
2017-10-17  0:07     ` Rafael J. Wysocki
2017-10-17  7:32       ` Li, Aubrey
2017-11-30  1:00 ` Li, Aubrey
2017-11-30  1:37   ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.