linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V2 00/19] Power Scheduler Design
@ 2014-08-11 11:31 Preeti U Murthy
  2014-08-11 11:32 ` [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning Preeti U Murthy
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:31 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

The new power aware scheduling framework is being designed with a goal that
all the cpu power management is in one place. Today the power management 
policies are fragmented between the cpuidle and cpufreq subsystems, which
makes power management inconsistent. To top this, we were integrating
task packing algorithms into the scheduler which could potentially worsen
the scenario.

The new power aware scheduler design will have all policies, all metrics,
all averaging concerning cpuidle and cpufrequency in one place, that being the
scheduler. This patchset lays the foundation for this approach to help 
remove the existing fragmented approach towards cpu power savings.

NOTE: This patchset targets only cpuidle. cpu-frequency can be integrated into
this design on the same lines.

The design is broken down into incremental steps which will enable
easy validation of the power aware scheduler. This by no means is complete
and will require more work to get to a stage where it can beat the 
current approach. Like I said this is just the foundation to help us
get started. The subsequent patches can be small incremental measured steps.

Ingo had pointed out this approach in http://lwn.net/Articles/552889/ and I
have tried my best at understanding and implementing the initial steps that
he suggested.

1.Start from the dumbest possible state: all CPUs  are powered up fully,
there's no idle state selection essentially. 

2.Then go  for the biggest effect first and add the ability to idle in a
lower power  state (with new functions and a low level driver that implements
this for the platform with no policy embedded into it.

3.Implement the task packing algorithm.

This patchset implements the above three steps and makes the fundamental design
of power aware scheduler clear. It shows how:

1.The design should be non intrusive with the existing code. It should be
enabled/disabled by a config switch. This way we can continue to work towards
making it better without having to worry about regressing the kernel and
yet have it in the kernel at the same time; a confidence booster that it is
making headway.
  CONFIG_SCHED_POWER is the switch that makes the new code appear when turned on
and disappear and default to the original code when turned off.

2.The design should help us test it better. Like Ingo pointed out:

"Important: it's not a problem that the initial code won't outperform the 
current kernel's performance. It should outperform the _initial_ 'dumb'
code in the first step. Then the next step should outperform the previous 
step, etc.
  The quality of this iterative approach will eventually surpass the 
combined effect of currently available but non-integrated facilities."

This is precisely what this design does. PATCH[1/19] disables cpuidle and
cpufrequency sub systems altogether if CONFIG_SCHED_POWER is enabled.
This is the dumb code. Our subsequent patches should outperform this.

3. Introduce a low level driver which interfaces scheduler with C-state
switching. Again Ingo had pointed out this saying:
"It should be presented to the scheduler in a platform independent fashion,
but without policy embedded: a low level platform driver interface in essence."

PATCH[2/19] ensures that CPUIDLE governors no longer control
idle state selection. The idle state selection and policies are moved into
kernel/sched/power.c. True, its the same code from the menu governor, however
it has been moved into scheduler specific code and no longer functions like
a driver. Its meant to be part of the core kernel. The "low level driver" lives
under drivers/cpuidle/cpuidle.c like before. It registers platform specific
cpuidle drivers and does other low level stuff that the scheduler needn't
bother about. It has no policies embedded into it whatsoever. Importantly it
is an entry point to switching C states and nothing beyond that.

PATCH[3/19] enumerates idle states and parameters in the scheduler topology.
This is so that the scheduler knows the cost of entry/exit into 
idle states that can be made use of going ahead. As an example, this patchset
shows how the platform specific cpuidle driver should help fill up the idle state
details into the topology. This fundamental information is missing today in the
scheduler.

These two patches are not expected to change the performance/power savings
in any way. They are just the first steps towards the integrated approach of
the power aware scheduler.

The patches PATCH[4/19] to PATCH[18/19] do task packing. This series is the
one that Alex Shi had posted long ago https://lkml.org/lkml/2013/3/30/78.
However this patch series will come into effect only if CONFIG_SCHED_POWER is
enabled. It is this series which is expected to bring about changes in
performance and power savings; not necessarily better than the existing code,
but certainly should be better than the dumb code.

Our subsequent efforts should surpass the performance/powersavings of the
existing code. This patch series is compile tested only.

V1 of this power efficient scheduling design was posted by Morten after
Ingo posted his suggestions on http://lwn.net/Articles/552889/.
[RFC][PATCH 0/9] sched: Power scheduler design proposal:
https://lkml.org/lkml/2013/7/15/101
   But it decoupled the scheduler into the regular and power scheduler with
the latter controlling the cpus that could be used by the regular scheduler.
We do not need this kind of decoupling. With the foundation that this patch
set lays, it must be relatively easy to make the existing scheduler power
aware.

---

Alex Shi (16):
      sched: add sched balance policies in kernel
      sched: add sysfs interface for sched_balance_policy selection
      sched: log the cpu utilization at rq
      sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
      sched: move sg/sd_lb_stats struct ahead
      sched: get rq potential maximum utilization
      sched: detect wakeup burst with rq->avg_idle
      sched: add power aware scheduling in fork/exec/wake
      sched: using avg_idle to detect bursty wakeup
      sched: packing transitory tasks in wakeup power balancing
      sched: add power/performance balance allow flag
      sched: pull all tasks from source grp and no balance for prefer_sibling
      sched: add new members of sd_lb_stats
      sched: power aware load balance
      sched: lazy power balance
      sched: don't do power balance on share cpu power domain

Preeti U Murthy (3):
      sched/power: Remove cpu idle state selection and cpu frequency tuning
      sched/power: Move idle state selection into the scheduler
      sched/idle: Enumerate idle states in scheduler topology


 Documentation/ABI/testing/sysfs-devices-system-cpu |   23 +
 arch/powerpc/Kconfig                               |    1 
 arch/powerpc/platforms/powernv/Kconfig             |   12 
 drivers/cpufreq/Kconfig                            |    2 
 drivers/cpuidle/Kconfig                            |   10 
 drivers/cpuidle/cpuidle-powernv.c                  |   10 
 drivers/cpuidle/cpuidle.c                          |   65 ++
 include/linux/sched.h                              |   16 -
 include/linux/sched/sysctl.h                       |    3 
 kernel/Kconfig.sched                               |   11 
 kernel/sched/Makefile                              |    1 
 kernel/sched/debug.c                               |    3 
 kernel/sched/fair.c                                |  632 +++++++++++++++++++-
 kernel/sched/power.c                               |  480 +++++++++++++++
 kernel/sched/sched.h                               |   16 +
 kernel/sysctl.c                                    |    9 
 16 files changed, 1234 insertions(+), 60 deletions(-)
 create mode 100644 kernel/Kconfig.sched
 create mode 100644 kernel/sched/power.c

--


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
@ 2014-08-11 11:32 ` Preeti U Murthy
  2014-08-18 15:39   ` Nicolas Pitre
  2014-08-11 11:33 ` [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler Preeti U Murthy
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:32 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

As a first step towards improving the power awareness of the scheduler,
this patch enables a "dumb" state where all power management is turned off.
Whatever additionally we put into the kernel for cpu power management must
do better than this in terms of performance as well as powersavings.
This will enable us to benchmark and optimize the power aware scheduler
from scratch.If we are to benchmark it against the performance of the
existing design, we will get sufficiently distracted by the performance
numbers and get steered away from a sane design.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/Kconfig                   |    1 +
 arch/powerpc/platforms/powernv/Kconfig |   12 ++++++------
 drivers/cpufreq/Kconfig                |    2 ++
 drivers/cpuidle/Kconfig                |    2 ++
 kernel/Kconfig.sched                   |   11 +++++++++++
 5 files changed, 22 insertions(+), 6 deletions(-)
 create mode 100644 kernel/Kconfig.sched

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 80b94b0..b7fe36a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -301,6 +301,7 @@ config HIGHMEM
 
 source kernel/Kconfig.hz
 source kernel/Kconfig.preempt
+source kernel/Kconfig.sched
 source "fs/Kconfig.binfmt"
 
 config HUGETLB_PAGE_SIZE_VARIABLE
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 45a8ed0..b0ef8b1 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -11,12 +11,12 @@ config PPC_POWERNV
 	select PPC_UDBG_16550
 	select PPC_SCOM
 	select ARCH_RANDOM
-	select CPU_FREQ
-	select CPU_FREQ_GOV_PERFORMANCE
-	select CPU_FREQ_GOV_POWERSAVE
-	select CPU_FREQ_GOV_USERSPACE
-	select CPU_FREQ_GOV_ONDEMAND
-	select CPU_FREQ_GOV_CONSERVATIVE
+	select CPU_FREQ if !SCHED_POWER
+	select CPU_FREQ_GOV_PERFORMANCE if CPU_FREQ
+	select CPU_FREQ_GOV_POWERSAVE if CPU_FREQ
+	select CPU_FREQ_GOV_USERSPACE if CPU_FREQ
+	select CPU_FREQ_GOV_ONDEMAND if CPU_FREQ
+	select CPU_FREQ_GOV_CONSERVATIVE if CPU_FREQ
 	select PPC_DOORBELL
 	default y
 
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index ffe350f..8976fd6 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -2,6 +2,7 @@ menu "CPU Frequency scaling"
 
 config CPU_FREQ
 	bool "CPU Frequency scaling"
+	depends on !SCHED_POWER
 	help
 	  CPU Frequency scaling allows you to change the clock speed of 
 	  CPUs on the fly. This is a nice method to save power, because 
@@ -12,6 +13,7 @@ config CPU_FREQ
 	  (see below) after boot, or use a userspace tool.
 
 	  For details, take a look at <file:Documentation/cpu-freq>.
+	  This feature will turn off if power aware scheduling is enabled.
 
 	  If in doubt, say N.
 
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 32748c3..2c4ac79 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -3,6 +3,7 @@ menu "CPU Idle"
 config CPU_IDLE
 	bool "CPU idle PM support"
 	default y if ACPI || PPC_PSERIES
+	depends on !SCHED_POWER
 	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
 	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
 	help
@@ -11,6 +12,7 @@ config CPU_IDLE
 	  governors that can be swapped during runtime.
 
 	  If you're using an ACPI-enabled platform, you should say Y here.
+	  This feature will turn off if power aware scheduling is enabled.
 
 if CPU_IDLE
 
diff --git a/kernel/Kconfig.sched b/kernel/Kconfig.sched
new file mode 100644
index 0000000..374454c
--- /dev/null
+++ b/kernel/Kconfig.sched
@@ -0,0 +1,11 @@
+menu "Power Aware Scheduling"
+
+config SCHED_POWER
+        bool "Power Aware Scheduler"
+        default n
+        help
+           Select this to enable the new power aware scheduler.
+endmenu
+
+
+


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
  2014-08-11 11:32 ` [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning Preeti U Murthy
@ 2014-08-11 11:33 ` Preeti U Murthy
  2014-08-18 15:54   ` Nicolas Pitre
  2014-08-11 11:33 ` [RFC PATCH V2 03/19] sched/idle: Enumerate idle states in scheduler topology Preeti U Murthy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:33 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

The goal of the power aware scheduling design is to integrate all
policy, metrics and averaging into the scheduler. Today the
cpu power management is fragmented and hence inconsistent.

As a first step towards this integration, rid the cpuidle state management
of the governors. Retain only the cpuidle driver in the cpu idle
susbsystem which acts as an interface between the scheduler and low
level platform specific cpuidle drivers. For all decision making around
selection of idle states,the cpuidle driver falls back to the scheduler.

The current algorithm for idle state selection is the same as the logic used
by the menu governor. However going ahead the heuristics will be tuned and
improved upon with metrics better known to the scheduler.

Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/Kconfig           |   12 +
 drivers/cpuidle/cpuidle-powernv.c |    2 
 drivers/cpuidle/cpuidle.c         |   65 ++++-
 include/linux/sched.h             |    9 +
 kernel/sched/Makefile             |    1 
 kernel/sched/power.c              |  480 +++++++++++++++++++++++++++++++++++++
 6 files changed, 554 insertions(+), 15 deletions(-)
 create mode 100644 kernel/sched/power.c

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 2c4ac79..4fa4cb1 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -3,16 +3,14 @@ menu "CPU Idle"
 config CPU_IDLE
 	bool "CPU idle PM support"
 	default y if ACPI || PPC_PSERIES
-	depends on !SCHED_POWER
-	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
-	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
+	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
+	select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER)
 	help
 	  CPU idle is a generic framework for supporting software-controlled
 	  idle processor power management.  It includes modular cross-platform
 	  governors that can be swapped during runtime.
 
 	  If you're using an ACPI-enabled platform, you should say Y here.
-	  This feature will turn off if power aware scheduling is enabled.
 
 if CPU_IDLE
 
@@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS
 config CPU_IDLE_GOV_LADDER
 	bool "Ladder governor (for periodic timer tick)"
 	default y
+	depends on !SCHED_POWER
+	help
+	  This feature will turn off if power aware scheduling is enabled.
 
 config CPU_IDLE_GOV_MENU
 	bool "Menu governor (for tickless system)"
 	default y
+	depends on !SCHED_POWER
+	help
+	  This feature will turn off if power aware scheduling is enabled.
 
 menu "ARM CPU Idle Drivers"
 depends on ARM
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index fa79392..95ef533 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 	unsigned long new_lpcr;
 
 	if (powersave_nap < 2)
-		return;
+		return 0;
 	if (unlikely(system_state < SYSTEM_RUNNING))
 		return index;
 
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ee9df5e..38fb213 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 	return entered_state;
 }
 
+#ifdef CONFIG_SCHED_POWER
+static int __cpuidle_select(struct cpuidle_driver *drv,
+				struct cpuidle_device *dev)
+{
+	return cpuidle_sched_select(drv, dev);
+}
+#else
+static int __cpuidle_select(struct cpuidle_driver *drv,
+				struct cpuidle_device *dev)
+{
+	return cpuidle_curr_governor->select(drv, dev);	
+}
+#endif
 /**
  * cpuidle_select - ask the cpuidle framework to choose an idle state
  *
@@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	if (unlikely(use_deepest_state))
 		return cpuidle_find_deepest_state(drv, dev);
 
-	return cpuidle_curr_governor->select(drv, dev);
+	return __cpuidle_select(drv, dev);
 }
 
 /**
@@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	return cpuidle_enter_state(dev, drv, index);
 }
 
+#ifdef CONFIG_SCHED_POWER
+static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
+{
+	cpuidle_sched_reflect(dev, index);
+}
+#else
+static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
+{
+	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
+		cpuidle_curr_governor->reflect(dev, index);
+}
+#endif
 /**
  * cpuidle_reflect - tell the underlying governor what was the state
  * we were in
@@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
  */
 void cpuidle_reflect(struct cpuidle_device *dev, int index)
 {
-	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
-		cpuidle_curr_governor->reflect(dev, index);
+	__cpuidle_reflect(dev, index);
 }
 
 /**
@@ -265,6 +289,28 @@ void cpuidle_resume(void)
 	mutex_unlock(&cpuidle_lock);
 }
 
+#ifdef CONFIG_SCHED_POWER
+static int cpuidle_check_governor(struct cpuidle_driver *drv,
+					struct cpuidle_device *dev, int enable)
+{
+	if (enable)
+		return cpuidle_sched_enable_device(drv, dev);
+	else
+		return 0;
+}
+#else
+static int cpuidle_check_governor(struct cpuidle_driver *drv,
+					struct cpuidle_device *dev, int enable)
+{
+	if (!cpuidle_curr_governor)
+		return -EIO;
+
+	if (enable && cpuidle_curr_governor->enable)
+		return cpuidle_curr_governor->enable(drv, dev);
+	else if (cpuidle_curr_governor->disable)
+		cpuidle_curr_governor->disable(drv, dev);
+}
+#endif
 /**
  * cpuidle_enable_device - enables idle PM for a CPU
  * @dev: the CPU
@@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
 
 	drv = cpuidle_get_cpu_driver(dev);
 
-	if (!drv || !cpuidle_curr_governor)
+	if (!drv)
 		return -EIO;
 
 	if (!dev->registered)
@@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
 	if (ret)
 		return ret;
 
-	if (cpuidle_curr_governor->enable &&
-	    (ret = cpuidle_curr_governor->enable(drv, dev)))
+	ret = cpuidle_check_governor(drv, dev, 1);
+	if (ret)
 		goto fail_sysfs;
 
 	smp_wmb();
@@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev)
 	if (!dev || !dev->enabled)
 		return;
 
-	if (!drv || !cpuidle_curr_governor)
+	if (!drv)
 		return;
-
+	
 	dev->enabled = 0;
 
-	if (cpuidle_curr_governor->disable)
-		cpuidle_curr_governor->disable(drv, dev);
+	cpuidle_check_governor(drv, dev, 0);
 
 	cpuidle_remove_device_sysfs(dev);
 	enabled_devices--;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7c19d55..5dd99b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,6 +26,7 @@ struct sched_param {
 #include <linux/nodemask.h>
 #include <linux/mm_types.h>
 #include <linux/preempt_mask.h>
+#include <linux/cpuidle.h>
 
 #include <asm/page.h>
 #include <asm/ptrace.h>
@@ -846,6 +847,14 @@ enum cpu_idle_type {
 	CPU_MAX_IDLE_TYPES
 };
 
+#ifdef CONFIG_SCHED_POWER
+extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index);
+extern int cpuidle_sched_select(struct cpuidle_driver *drv,
+					struct cpuidle_device *dev);
+extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
+						struct cpuidle_device *dev);
+#endif
+
 /*
  * Increase resolution of cpu_capacity calculations
  */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..5b8e469 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_POWER) += power.o
diff --git a/kernel/sched/power.c b/kernel/sched/power.c
new file mode 100644
index 0000000..63c9276
--- /dev/null
+++ b/kernel/sched/power.c
@@ -0,0 +1,480 @@
+/*
+ * power.c - the power aware scheduler
+ *
+ * Author:
+ *        Preeti U. Murthy <preeti@linux.vnet.ibm.com>
+ *
+ * This code is a replica of drivers/cpuidle/governors/menu.c
+ * To make the transition to power aware scheduler away from
+ * the cpuidle governor model easy, we do exactly what the
+ * governors do for now. Going ahead the heuristics will be
+ * tuned and improved upon.
+ *
+ * This code is licenced under the GPL version 2 as described
+ * in the COPYING file that acompanies the Linux Kernel.
+ */
+
+#include <linux/kernel.h>
+#include <linux/cpuidle.h>
+#include <linux/pm_qos.h>
+#include <linux/time.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/tick.h>
+#include <linux/sched.h>
+#include <linux/math64.h>
+#include <linux/module.h>
+
+/*
+ * Please note when changing the tuning values:
+ * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
+ * a scaling operation multiplication may overflow on 32 bit platforms.
+ * In that case, #define RESOLUTION as ULL to get 64 bit result:
+ * #define RESOLUTION 1024ULL
+ *
+ * The default values do not overflow.
+ */
+#define BUCKETS 12
+#define INTERVALS 8
+#define RESOLUTION 1024
+#define DECAY 8
+#define MAX_INTERESTING 50000
+
+
+/*
+ * Concepts and ideas behind the power aware scheduler
+ *
+ * For the power aware scheduler, there are 3 decision factors for picking a C
+ * state:
+ * 1) Energy break even point
+ * 2) Performance impact
+ * 3) Latency tolerance (from pmqos infrastructure)
+ * These these three factors are treated independently.
+ *
+ * Energy break even point
+ * -----------------------
+ * C state entry and exit have an energy cost, and a certain amount of time in
+ * the  C state is required to actually break even on this cost. CPUIDLE
+ * provides us this duration in the "target_residency" field. So all that we
+ * need is a good prediction of how long we'll be idle. Like the traditional
+ * governors, we start with the actual known "next timer event" time.
+ *
+ * Since there are other source of wakeups (interrupts for example) than
+ * the next timer event, this estimation is rather optimistic. To get a
+ * more realistic estimate, a correction factor is applied to the estimate,
+ * that is based on historic behavior. For example, if in the past the actual
+ * duration always was 50% of the next timer tick, the correction factor will
+ * be 0.5.
+ *
+ * power aware scheduler uses a running average for this correction factor,
+ * however it uses a set of factors, not just a single factor. This stems from
+ * the realization that the ratio is dependent on the order of magnitude of the
+ * expected duration; if we expect 500 milliseconds of idle time the likelihood of
+ * getting an interrupt very early is much higher than if we expect 50 micro
+ * seconds of idle time. A second independent factor that has big impact on
+ * the actual factor is if there is (disk) IO outstanding or not.
+ * (as a special twist, we consider every sleep longer than 50 milliseconds
+ * as perfect; there are no power gains for sleeping longer than this)
+ *
+ * For these two reasons we keep an array of 12 independent factors, that gets
+ * indexed based on the magnitude of the expected duration as well as the
+ * "is IO outstanding" property.
+ *
+ * Repeatable-interval-detector
+ * ----------------------------
+ * There are some cases where "next timer" is a completely unusable predictor:
+ * Those cases where the interval is fixed, for example due to hardware
+ * interrupt mitigation, but also due to fixed transfer rate devices such as
+ * mice.
+ * For this, we use a different predictor: We track the duration of the last 8
+ * intervals and if the stand deviation of these 8 intervals is below a
+ * threshold value, we use the average of these intervals as prediction.
+ *
+ * Limiting Performance Impact
+ * ---------------------------
+ * C states, especially those with large exit latencies, can have a real
+ * noticeable impact on workloads, which is not acceptable for most sysadmins,
+ * and in addition, less performance has a power price of its own.
+ *
+ * As a general rule of thumb, power aware sched assumes that the following
+ * heuristic holds:
+ *     The busier the system, the less impact of C states is acceptable
+ *
+ * This rule-of-thumb is implemented using a performance-multiplier:
+ * If the exit latency times the performance multiplier is longer than
+ * the predicted duration, the C state is not considered a candidate
+ * for selection due to a too high performance impact. So the higher
+ * this multiplier is, the longer we need to be idle to pick a deep C
+ * state, and thus the less likely a busy CPU will hit such a deep
+ * C state.
+ *
+ * Two factors are used in determing this multiplier:
+ * a value of 10 is added for each point of "per cpu load average" we have.
+ * a value of 5 points is added for each process that is waiting for
+ * IO on this CPU.
+ * (these values are experimentally determined)
+ *
+ * The load average factor gives a longer term (few seconds) input to the
+ * decision, while the iowait value gives a cpu local instantanious input.
+ * The iowait factor may look low, but realize that this is also already
+ * represented in the system load average.
+ *
+ */
+
+struct sched_cpuidle_info {
+	int		last_state_idx;
+	int             needs_update;
+
+	unsigned int	next_timer_us;
+	unsigned int	predicted_us;
+	unsigned int	bucket;
+	unsigned int	correction_factor[BUCKETS];
+	unsigned int	intervals[INTERVALS];
+	int		interval_ptr;
+};
+
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
+static int get_loadavg(void)
+{
+	unsigned long this = this_cpu_load();
+
+
+	return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
+}
+
+static inline int which_bucket(unsigned int duration)
+{
+	int bucket = 0;
+
+	/*
+	 * We keep two groups of stats; one with no
+	 * IO pending, one without.
+	 * This allows us to calculate
+	 * E(duration)|iowait
+	 */
+	if (nr_iowait_cpu(smp_processor_id()))
+		bucket = BUCKETS/2;
+
+	if (duration < 10)
+		return bucket;
+	if (duration < 100)
+		return bucket + 1;
+	if (duration < 1000)
+		return bucket + 2;
+	if (duration < 10000)
+		return bucket + 3;
+	if (duration < 100000)
+		return bucket + 4;
+	return bucket + 5;
+}
+
+/*
+ * Return a multiplier for the exit latency that is intended
+ * to take performance requirements into account.
+ * The more performance critical we estimate the system
+ * to be, the higher this multiplier, and thus the higher
+ * the barrier to go to an expensive C state.
+ */
+static inline int performance_multiplier(void)
+{
+	int mult = 1;
+
+	/* for higher loadavg, we are more reluctant */
+
+	mult += 2 * get_loadavg();
+
+	/* for IO wait tasks (per cpu!) we add 5x each */
+	mult += 10 * nr_iowait_cpu(smp_processor_id());
+
+	return mult;
+}
+
+static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
+
+static void cpuidle_sched_update(struct cpuidle_driver *drv,
+					struct cpuidle_device *dev);
+
+/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */
+static u64 div_round64(u64 dividend, u32 divisor)
+{
+	return div_u64(dividend + (divisor / 2), divisor);
+}
+
+/*
+ * Try detecting repeating patterns by keeping track of the last 8
+ * intervals, and checking if the standard deviation of that set
+ * of points is below a threshold. If it is... then use the
+ * average of these 8 points as the estimated value.
+ */
+static void get_typical_interval(struct sched_cpuidle_info *data)
+{
+	int i, divisor;
+	unsigned int max, thresh;
+	uint64_t avg, stddev;
+
+	thresh = UINT_MAX; /* Discard outliers above this value */
+
+again:
+
+	/* First calculate the average of past intervals */
+	max = 0;
+	avg = 0;
+	divisor = 0;
+	for (i = 0; i < INTERVALS; i++) {
+		unsigned int value = data->intervals[i];
+		if (value <= thresh) {
+			avg += value;
+			divisor++;
+			if (value > max)
+				max = value;
+		}
+	}
+	do_div(avg, divisor);
+
+	/* Then try to determine standard deviation */
+	stddev = 0;
+	for (i = 0; i < INTERVALS; i++) {
+		unsigned int value = data->intervals[i];
+		if (value <= thresh) {
+			int64_t diff = value - avg;
+			stddev += diff * diff;
+		}
+	}
+	do_div(stddev, divisor);
+	/*
+	 * The typical interval is obtained when standard deviation is small
+	 * or standard deviation is small compared to the average interval.
+	 *
+	 * int_sqrt() formal parameter type is unsigned long. When the
+	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
+	 * the resulting squared standard deviation exceeds the input domain
+	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
+	 * In such case reject the candidate average.
+	 *
+	 * Use this result only if there is no timer to wake us up sooner.
+	 */
+	if (likely(stddev <= ULONG_MAX)) {
+		stddev = int_sqrt(stddev);
+		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
+							|| stddev <= 20) {
+			if (data->next_timer_us > avg)
+				data->predicted_us = avg;
+			return;
+		}
+	}
+
+	/*
+	 * If we have outliers to the upside in our distribution, discard
+	 * those by setting the threshold to exclude these outliers, then
+	 * calculate the average and standard deviation again. Once we get
+	 * down to the bottom 3/4 of our samples, stop excluding samples.
+	 *
+	 * This can deal with workloads that have long pauses interspersed
+	 * with sporadic activity with a bunch of short pauses.
+	 */
+	if ((divisor * 4) <= INTERVALS * 3)
+		return;
+
+	thresh = max - 1;
+	goto again;
+}
+
+/**
+ * cpuidle_sched_select - selects the next idle state to enter
+ * @drv: cpuidle driver containing state data
+ * @dev: the CPU
+ */
+int cpuidle_sched_select(struct cpuidle_driver *drv,
+				struct cpuidle_device *dev)
+{
+	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
+	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	int i;
+	unsigned int interactivity_req;
+	struct timespec t;
+
+	if (data->needs_update) {
+		cpuidle_sched_update(drv, dev);
+		data->needs_update = 0;
+	}
+
+	data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
+
+	/* Special case when user has set very strict latency requirement */
+	if (unlikely(latency_req == 0))
+		return 0;
+
+	/* determine the expected residency time, round up */
+	t = ktime_to_timespec(tick_nohz_get_sleep_length());
+	data->next_timer_us =
+		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
+
+
+	data->bucket = which_bucket(data->next_timer_us);
+
+	/*
+	 * Force the result of multiplication to be 64 bits even if both
+	 * operands are 32 bits.
+	 * Make sure to round up for half microseconds.
+	 */
+	data->predicted_us = div_round64((uint64_t)data->next_timer_us *
+					 data->correction_factor[data->bucket],
+					 RESOLUTION * DECAY);
+
+	get_typical_interval(data);
+
+	/*
+	 * Performance multiplier defines a minimum predicted idle
+	 * duration / latency ratio. Adjust the latency limit if
+	 * necessary.
+	 */
+	interactivity_req = data->predicted_us / performance_multiplier();
+	if (latency_req > interactivity_req)
+		latency_req = interactivity_req;
+
+	/*
+	 * We want to default to C1 (hlt), not to busy polling
+	 * unless the timer is happening really really soon.
+	 */
+	if (data->next_timer_us > 5 &&
+	    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
+		dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
+
+	/*
+	 * Find the idle state with the lowest power while satisfying
+	 * our constraints.
+	 */
+	for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
+		struct cpuidle_state *s = &drv->states[i];
+		struct cpuidle_state_usage *su = &dev->states_usage[i];
+
+		if (s->disabled || su->disable)
+			continue;
+		if (s->target_residency > data->predicted_us)
+			continue;
+		if (s->exit_latency > latency_req)
+			continue;
+
+		data->last_state_idx = i;
+	}
+
+	return data->last_state_idx;
+}
+
+/**
+ * cpuidle_sched_reflect - records that data structures need update
+ * @dev: the CPU
+ * @index: the index of actual entered state
+ *
+ * NOTE: it's important to be fast here because this operation will add to
+ *       the overall exit latency.
+ */
+void cpuidle_sched_reflect(struct cpuidle_device *dev, int index)
+{
+	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
+	data->last_state_idx = index;
+	if (index >= 0)
+		data->needs_update = 1;
+}
+
+/**
+ * cpuidle_sched_update - attempts to guess what happened after entry
+ * @drv: cpuidle driver containing state data
+ * @dev: the CPU
+ */
+static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
+	int last_idx = data->last_state_idx;
+	struct cpuidle_state *target = &drv->states[last_idx];
+	unsigned int measured_us;
+	unsigned int new_factor;
+
+	/*
+	 * Try to figure out how much time passed between entry to low
+	 * power state and occurrence of the wakeup event.
+	 *
+	 * If the entered idle state didn't support residency measurements,
+	 * we are basically lost in the dark how much time passed.
+	 * As a compromise, assume we slept for the whole expected time.
+	 *
+	 * Any measured amount of time will include the exit latency.
+	 * Since we are interested in when the wakeup begun, not when it
+	 * was completed, we must subtract the exit latency. However, if
+	 * the measured amount of time is less than the exit latency,
+	 * assume the state was never reached and the exit latency is 0.
+	 */
+	if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
+		/* Use timer value as is */
+		measured_us = data->next_timer_us;
+
+	} else {
+		/* Use measured value */
+		measured_us = cpuidle_get_last_residency(dev);
+
+		/* Deduct exit latency */
+		if (measured_us > target->exit_latency)
+			measured_us -= target->exit_latency;
+
+		/* Make sure our coefficients do not exceed unity */
+		if (measured_us > data->next_timer_us)
+			measured_us = data->next_timer_us;
+	}
+
+	/* Update our correction ratio */
+	new_factor = data->correction_factor[data->bucket];
+	new_factor -= new_factor / DECAY;
+
+	if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
+		new_factor += RESOLUTION * measured_us / data->next_timer_us;
+	else
+		/*
+		 * we were idle so long that we count it as a perfect
+		 * prediction
+		 */
+		new_factor += RESOLUTION;
+
+	/*
+	 * We don't want 0 as factor; we always want at least
+	 * a tiny bit of estimated time. Fortunately, due to rounding,
+	 * new_factor will stay nonzero regardless of measured_us values
+	 * and the compiler can eliminate this test as long as DECAY > 1.
+	 */
+	if (DECAY == 1 && unlikely(new_factor == 0))
+		new_factor = 1;
+
+	data->correction_factor[data->bucket] = new_factor;
+
+	/* update the repeating-pattern data */
+	data->intervals[data->interval_ptr++] = measured_us;
+	if (data->interval_ptr >= INTERVALS)
+		data->interval_ptr = 0;
+}
+
+/**
+ * cpuidle_sched_enable_device - scans a CPU's states and does setup
+ * @drv: cpuidle driver
+ * @dev: the CPU
+ */
+int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
+				struct cpuidle_device *dev)
+{
+	struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
+	int i;
+
+	memset(data, 0, sizeof(struct sched_cpuidle_info));
+
+	/*
+	 * if the correction factor is 0 (eg first time init or cpu hotplug
+	 * etc), we actually want to start out with a unity factor.
+	 */
+	for(i = 0; i < BUCKETS; i++)
+		data->correction_factor[i] = RESOLUTION * DECAY;
+
+	return 0;
+}
+


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 03/19] sched/idle: Enumerate idle states in scheduler topology
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
  2014-08-11 11:32 ` [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning Preeti U Murthy
  2014-08-11 11:33 ` [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler Preeti U Murthy
@ 2014-08-11 11:33 ` Preeti U Murthy
  2014-08-11 11:34 ` [RFC PATCH V2 04/19] sched: add sched balance policies in kernel Preeti U Murthy
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:33 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

The goal of the power aware scheduler design is to integrate
all cpu power management in the scheduler. As a first step
the idle state selection was moved into the scheduler. Doing
this helps better decide which idle state to enter into using
metrics known by the scheduler. However the cost of entering
and exiting an idle state can help the scheduler do load balancing
better.It would be even better if the idle states can let the
scheduler know about the impact on the cache contents when
the cpu enters that state. The scheduler can make use of this data
while waking up tasks or scheduling new tasks. To make way for such
information to be propogated to the scheduler, enumerate idle states
in the scheduler topology levels.

Doing so will also let the scheduler know the idle states
that a *sched_group* can enter into at a given level of scheduling
domain. This means the scheduler is implicitly made aware of the
fact that idle state is not necessarily a per-cpu state, it can be a
per-core state or a state shared by a group of cpus that is specified
by the sched_group. The knowledge of this higher level cpuidle information
is missing today too.

The low level platform cpuidle drivers must expose to the scheduler
the idle states at the different topology levels. This patch takes
up the powernv cpuidle driver to illustrate this. The scheduling
topology is left to the arch to decide.
Commit 143e1e28cb40bed836 introduced this. The platform idle
drivers are thus in a better position to fill up the topology
levels with appropriate cpuidle state information while they discover
it themselves.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle-powernv.c |    8 ++++++++
 include/linux/sched.h             |    3 +++
 2 files changed, 11 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 95ef533..4232fbc 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -184,6 +184,11 @@ static int powernv_add_idle_states(void)
 
 	dt_idle_states = len_flags / sizeof(u32);
 
+#ifdef CONFIG_SCHED_POWER
+	/* Snooze is a thread level idle state; the rest are core level idle states */
+	sched_domain_topology[0].states[0] = powernv_states[0];
+#endif
+
 	for (i = 0; i < dt_idle_states; i++) {
 
 		flags = be32_to_cpu(idle_state_flags[i]);
@@ -209,6 +214,9 @@ static int powernv_add_idle_states(void)
 			powernv_states[nr_idle_states].enter = &fastsleep_loop;
 			nr_idle_states++;
 		}
+#ifdef CONFIG_SCHED_POWER
+		sched_domain_topology[1].states[i] = powernv_states[nr_idle_states];
+#endif
 	}
 
 	return nr_idle_states;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dd99b5..009da6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1027,6 +1027,9 @@ struct sched_domain_topology_level {
 #ifdef CONFIG_SCHED_DEBUG
 	char                *name;
 #endif
+#ifdef CONFIG_SCHED_POWER
+	struct cpuidle_state states[CPUIDLE_STATE_MAX];
+#endif
 };
 
 extern struct sched_domain_topology_level *sched_domain_topology;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 04/19] sched: add sched balance policies in kernel
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (2 preceding siblings ...)
  2014-08-11 11:33 ` [RFC PATCH V2 03/19] sched/idle: Enumerate idle states in scheduler topology Preeti U Murthy
@ 2014-08-11 11:34 ` Preeti U Murthy
  2014-08-11 11:34 ` [RFC PATCH V2 05/19] sched: add sysfs interface for sched_balance_policy selection Preeti U Murthy
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:34 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

Current scheduler behavior is just consider for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores

To adding the consideration of power awareness, the patchset adds
a powersaving scheduler policy. It will use runnable load util in
scheduler balancing. The current scheduling is taken as performance
policy.

performance: the current scheduling behaviour, try to spread tasks
                on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
                group is full, power oriented.

The incoming patches will enable powersaving scheduling in CFS.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c  |    5 +++++
 kernel/sched/sched.h |    7 +++++++
 2 files changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..77da534 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7800,6 +7800,11 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+#ifdef CONFIG_SCHED_POWER
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+#endif
+
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..95fc013 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -23,6 +23,13 @@ extern atomic_long_t calc_load_tasks;
 extern long calc_load_fold_active(struct rq *this_rq);
 extern void update_cpu_load_active(struct rq *this_rq);
 
+#ifdef CONFIG_SCHED_POWER
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+
+extern int __read_mostly sched_balance_policy;
+#endif
+
 /*
  * Helpers for converting nanosecond timing to jiffy resolution
  */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 05/19] sched: add sysfs interface for sched_balance_policy selection
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (3 preceding siblings ...)
  2014-08-11 11:34 ` [RFC PATCH V2 04/19] sched: add sched balance policies in kernel Preeti U Murthy
@ 2014-08-11 11:34 ` Preeti U Murthy
  2014-08-11 11:35 ` [RFC PATCH V2 06/19] sched: log the cpu utilization at rq Preeti U Murthy
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:34 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
performance powersaving
$cat /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
powersaving

This means the using sched balance policy is 'powersaving'.

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 Documentation/ABI/testing/sysfs-devices-system-cpu |   23 +++++++
 kernel/sched/fair.c                                |   69 ++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index acb9bfc..c7ed5c1 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,29 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
+		/sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS balance policy show and set interface.
+		This will get activated only when CONFIG_SCHED_POWER is set.
+
+		available_sched_balance_policy: shows there are 2 kinds of
+		policies:
+			performance powersaving.
+		current_sched_balance_policy: shows current scheduler policy.
+		User can change the policy by writing it.
+
+		Policy decides the CFS scheduler how to balance tasks onto
+		different CPU unit.
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores. performance oriented.
+
+		powersaving: try to pack tasks onto same core or same CPU
+		until every LCPUs are busy in the core or CPU socket.
+		powersaving oriented.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77da534..f1b0a33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5863,6 +5863,75 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
+#if defined(CONFIG_SYSFS) && defined(CONFIG_SCHED_POWER)
+static ssize_t show_available_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "performance powersaving\n");
+}
+
+static ssize_t show_current_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	return 0;
+}
+
+static ssize_t set_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_balance_policy = SCHED_POLICY_POWERSAVING;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_balance_policy, 0644,
+		show_current_sched_balance_policy, set_sched_balance_policy);
+
+static DEVICE_ATTR(available_sched_balance_policy, 0444,
+		show_available_sched_balance_policy, NULL);
+
+static struct attribute *sched_balance_policy_default_attrs[] = {
+	&dev_attr_current_sched_balance_policy.attr,
+	&dev_attr_available_sched_balance_policy.attr,
+	NULL
+};
+static struct attribute_group sched_balance_policy_attr_group = {
+	.attrs = sched_balance_policy_default_attrs,
+	.name = "sched_balance_policy",
+};
+
+int __init create_sysfs_sched_balance_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_balance_policy_attr_group);
+}
+
+static int __init sched_balance_policy_sysfs_init(void)
+{
+	return create_sysfs_sched_balance_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_balance_policy_sysfs_init);
+#endif /* CONFIG_SYSFS && CONFIG_SCHED_POWER*/
+
 /*
  * Compute the group capacity factor.
  *


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 06/19] sched: log the cpu utilization at rq
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (4 preceding siblings ...)
  2014-08-11 11:34 ` [RFC PATCH V2 05/19] sched: add sysfs interface for sched_balance_policy selection Preeti U Murthy
@ 2014-08-11 11:35 ` Preeti U Murthy
  2014-08-11 11:35 ` [RFC PATCH V2 07/19] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Preeti U Murthy
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:35 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

The cpu's utilization is to measure how busy is the cpu.
        util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE
                / cpu_rq(cpu)->avg.runnable_avg_period;

Since the util is no more than 1, we scale its value with 1024, same as
SCHED_POWER_SCALE and set the FULL_UTIL as 1024.

In later power aware scheduling, we are sensitive for how busy of the
cpu. Since as to power consuming, it is tight related with cpu busy
time.

BTW, rq->util can be used for any purposes if needed, not only power
scheduling.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/debug.c |    3 +++
 kernel/sched/fair.c  |   15 +++++++++++++++
 kernel/sched/sched.h |    9 +++++++++
 3 files changed, 27 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 627b3c3..395af7f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -325,6 +325,9 @@ do {									\
 
 	P(ttwu_count);
 	P(ttwu_local);
+#ifdef CONFIG_SCHED_POWER
+	P(util);
+#endif
 
 #undef P
 #undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b0a33..681ad06 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2446,10 +2446,25 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 	}
 }
 
+#ifdef CONFIG_SCHED_POWER
+static void update_rq_util(struct rq *rq)
+{
+	if (rq->avg.runnable_avg_period)
+		rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT)
+				/ rq->avg.runnable_avg_period;
+	else
+		rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT);
+}
+#else
+static void update_rq_util(struct rq *rq) {}
+#endif
+
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+	update_rq_util(rq);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 95fc013..971b812 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -508,6 +508,11 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+/* full cpu utilization */
+#ifdef CONFIG_SCHED_POWER
+#define FULL_UTIL	SCHED_CAPACITY_SCALE
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -556,6 +561,10 @@ struct rq {
 	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_POWER
+	unsigned int util;
+#endif /* CONFIG_SCHED_POWER */
+
 	/*
 	 * This is part of a global counter where only the total sum
 	 * over all CPUs matters. A task can increase this counter on


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 07/19] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (5 preceding siblings ...)
  2014-08-11 11:35 ` [RFC PATCH V2 06/19] sched: log the cpu utilization at rq Preeti U Murthy
@ 2014-08-11 11:35 ` Preeti U Murthy
  2014-08-11 11:36 ` [RFC PATCH V2 08/19] sched: move sg/sd_lb_stats struct ahead Preeti U Murthy
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:35 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

For power aware balancing, we care the sched domain/group's utilization.
So add: sd_lb_stats.sd_util and sg_lb_stats.group_util.

And want to know which group is busiest but still has capability to
handle more tasks, so add: sd_lb_stats.group_leader

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 681ad06..3d6d081 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5593,6 +5593,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_POWER
+	unsigned int group_util;	/* sum utilization of group */
+#endif
 };
 
 /*
@@ -5608,6 +5611,12 @@ struct sd_lb_stats {
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
+
+#ifdef CONFIG_SCHED_POWER
+        /* Varibles of power aware scheduling */
+        unsigned int  sd_util;  /* sum utilization of this domain */
+        struct sched_group *group_leader; /* Group which relieves group_min */
+#endif
 };
 
 static inline void init_sd_lb_stats(struct sd_lb_stats *sds)


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 08/19] sched: move sg/sd_lb_stats struct ahead
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (6 preceding siblings ...)
  2014-08-11 11:35 ` [RFC PATCH V2 07/19] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Preeti U Murthy
@ 2014-08-11 11:36 ` Preeti U Murthy
  2014-08-11 11:36 ` [RFC PATCH V2 09/19] sched: get rq potential maximum utilization Preeti U Murthy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:36 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

Power aware fork/exec/wake balancing needs both of structs in incoming
patches. So move ahead before it.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |   89 ++++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d6d081..031d115 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4505,6 +4505,51 @@ done:
 }
 
 /*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long load_per_task;
+	unsigned long group_capacity;
+	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int group_capacity_factor;
+	unsigned int idle_cpus;
+	unsigned int group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_free_capacity;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
+#ifdef CONFIG_SCHED_POWER
+	unsigned int group_util;	/* sum utilization of group */
+#endif
+};
+
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		 during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *local;	/* Local group in this sd */
+	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_capacity;	/* Total capacity of all groups in sd */
+	unsigned long avg_load;	/* Average load across all groups in sd */
+
+	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+	struct sg_lb_stats local_stat;	/* Statistics of the local group */
+
+#ifdef CONFIG_SCHED_POWER
+        /* Varibles of power aware scheduling */
+        unsigned int  sd_util;  /* sum utilization of this domain */
+        struct sched_group *group_leader; /* Group which relieves group_min */
+#endif
+};
+
+/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
@@ -5574,50 +5619,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long load_per_task;
-	unsigned long group_capacity;
-	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
-	unsigned int idle_cpus;
-	unsigned int group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_free_capacity;
-#ifdef CONFIG_NUMA_BALANCING
-	unsigned int nr_numa_running;
-	unsigned int nr_preferred_running;
-#endif
-#ifdef CONFIG_SCHED_POWER
-	unsigned int group_util;	/* sum utilization of group */
-#endif
-};
-
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- *		 during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest;	/* Busiest group in this sd */
-	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_load;	/* Total load of all groups in sd */
-	unsigned long total_capacity;	/* Total capacity of all groups in sd */
-	unsigned long avg_load;	/* Average load across all groups in sd */
-
-	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
-	struct sg_lb_stats local_stat;	/* Statistics of the local group */
-
-#ifdef CONFIG_SCHED_POWER
-        /* Varibles of power aware scheduling */
-        unsigned int  sd_util;  /* sum utilization of this domain */
-        struct sched_group *group_leader; /* Group which relieves group_min */
-#endif
-};
 
 static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 {


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 09/19] sched: get rq potential maximum utilization
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (7 preceding siblings ...)
  2014-08-11 11:36 ` [RFC PATCH V2 08/19] sched: move sg/sd_lb_stats struct ahead Preeti U Murthy
@ 2014-08-11 11:36 ` Preeti U Murthy
  2014-08-11 11:37 ` [RFC PATCH V2 10/19] sched: detect wakeup burst with rq->avg_idle Preeti U Murthy
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:36 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

Since the rt task priority is higher than fair tasks, cfs_rq utilization
is just the left of rt utilization.

When there are some cfs tasks in queue, the potential utilization may
be yielded, so mulitiplying cfs task number to get max potential
utilization of cfs. Then the rq utilization is sum of rt util and cfs
util.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 031d115..60abaf4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3308,6 +3308,53 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
+#ifdef CONFIG_SCHED_POWER
+static unsigned long scale_rt_util(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u64 total, age_stamp, avg;
+	s64 delta;
+
+	/*
+	 * Since we're reading these variables without serialization make sure
+	 * we read them once before doing sanity checks on them.
+	 */
+	age_stamp = ACCESS_ONCE(rq->age_stamp);
+	avg = ACCESS_ONCE(rq->rt_avg);
+
+	delta = rq_clock(rq) - age_stamp;
+	if (unlikely(delta < 0))
+		delta = 0;
+
+	total = sched_avg_period() + delta;
+
+	if (unlikely(total < avg)) {
+		/* Ensures that capacity won't go beyond full scaled value */
+		return SCHED_CAPACITY_SCALE;
+	}
+
+	if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
+		total = SCHED_CAPACITY_SCALE;
+
+	total >>= SCHED_CAPACITY_SHIFT;
+
+	return div_u64(avg, total);
+}
+
+static unsigned int max_rq_util(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned int rt_util = scale_rt_util(cpu);
+	unsigned int cfs_util;
+	unsigned int nr_running;
+
+	cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util
+			: (FULL_UTIL - rt_util);
+	nr_running = rq->nr_running ? rq->nr_running : 1;
+
+	return rt_util + cfs_util * nr_running;
+}
+#endif
 /*
  * Ensure that neither of the group entities corresponding to src_cpu or
  * dest_cpu are members of a throttled hierarchy when performing group


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 10/19] sched: detect wakeup burst with rq->avg_idle
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (8 preceding siblings ...)
  2014-08-11 11:36 ` [RFC PATCH V2 09/19] sched: get rq potential maximum utilization Preeti U Murthy
@ 2014-08-11 11:37 ` Preeti U Murthy
  2014-08-11 11:38 ` [RFC PATCH V2 11/19] sched: add power aware scheduling in fork/exec/wake Preeti U Murthy
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:37 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.

With this cheap and smart bursty indicator, we can find the wake up
burst, and just use nr_running as instant utilization only.

The 'sysctl_sched_burst_threshold' used for wakeup burst, set it as
double of sysctl_sched_migration_cost.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 include/linux/sched/sysctl.h |    3 +++
 kernel/sched/fair.c          |    4 ++++
 kernel/sysctl.c              |    9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..0a3307b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -55,6 +55,9 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
+#ifdef CONFIG_SCHED_POWER
+extern unsigned int sysctl_sched_burst_threshold;
+#endif /* CONFIG_SCHED_POWER */
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60abaf4..20e2414 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -92,6 +92,10 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+#ifdef CONFIG_SCHED_POWER
+const_debug unsigned int sysctl_sched_burst_threshold = 1000000UL;
+#endif
+
 /*
  * The exponential sliding  window over which load is averaged for shares
  * distribution.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 75875a7..8175d93 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -329,6 +329,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_SCHED_POWER
+	{
+		.procname	= "sched_burst_threshold_ns",
+		.data		= &sysctl_sched_burst_threshold,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_SCHED_POWER */
 	{
 		.procname	= "sched_nr_migrate",
 		.data		= &sysctl_sched_nr_migrate,


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 11/19] sched: add power aware scheduling in fork/exec/wake
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (9 preceding siblings ...)
  2014-08-11 11:37 ` [RFC PATCH V2 10/19] sched: detect wakeup burst with rq->avg_idle Preeti U Murthy
@ 2014-08-11 11:38 ` Preeti U Murthy
  2014-08-11 11:38 ` [RFC PATCH V2 12/19] sched: using avg_idle to detect bursty wakeup Preeti U Murthy
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:38 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power since it leaves more groups idle in system.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving policy. No clear change for performance
policy.

The main function in this patch is get_cpu_for_power_policy(), that
will try to get a idlest cpu from the busiest while still has
utilization group, if the system is using power aware policy and
has such group.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |  117 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20e2414..e993f1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4600,6 +4600,103 @@ struct sd_lb_stats {
 #endif
 };
 
+#ifdef CONFIG_SCHED_POWER
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+	struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group))
+		sgs->group_util += max_rq_util(i);
+
+	sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Is this domain full of utilization with the task?
+ */
+static int is_sd_full(struct sched_domain *sd,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	long sd_min_delta = LONG_MAX;
+	unsigned int putil;
+
+	if (p->se.load.weight == p->se.avg.load_avg_contrib)
+		/* p maybe a new forked task */
+		putil = FULL_UTIL;
+	else
+		putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT)
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	/* Try to collect the domain's utilization */
+	group = sd->groups;
+	do {
+		long g_delta;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_util += sgs.group_util;
+	} while  (group = group->next, group != sd->groups);
+
+	if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
+		return 0;
+
+	/* can not hold one more task in this domain */
+	return 1;
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
+	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+		return SCHED_POLICY_PERFORMANCE;
+
+	memset(sds, 0, sizeof(*sds));
+	if (is_sd_full(sd, p, sds))
+		return SCHED_POLICY_PERFORMANCE;
+	return sched_balance_policy;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	int policy;
+	int new_cpu = -1;
+
+	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+	return new_cpu;
+}
+#else
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	return -1;
+}
+#endif /* CONFIG_SCHED_POWER */
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -4608,6 +4705,9 @@ struct sd_lb_stats {
  * Balances load by selecting the idlest cpu in the idlest group, or under
  * certain conditions an idle sibling cpu if the domain has SD_WAKE_AFFINE set.
  *
+ * If CONFIG_SCHED_POWER is set and SCHED_POLICY_POWERSAVE is enabled, the power
+ * aware scheduler kicks in. It returns a cpu appropriate for power savings.
+ *
  * Returns the target cpu number.
  *
  * preempt must be disabled.
@@ -4620,6 +4720,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
+	struct sd_lb_stats sds;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -4645,12 +4746,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			if (new_cpu != -1)
+				goto unlock;
+		}
 	}
+	if (affine_sd) {
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		if (new_cpu != -1)
+			goto unlock;
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+ 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ 			prev_cpu = cpu;
+	}
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 12/19] sched: using avg_idle to detect bursty wakeup
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (10 preceding siblings ...)
  2014-08-11 11:38 ` [RFC PATCH V2 11/19] sched: add power aware scheduling in fork/exec/wake Preeti U Murthy
@ 2014-08-11 11:38 ` Preeti U Murthy
  2014-08-11 11:39 ` [RFC PATCH V2 13/19] sched: packing transitory tasks in wakeup power balancing Preeti U Murthy
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:38 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

Sleeping task has no utiliation, when they were bursty waked up, the
zero utilization make scheduler out of balance, like aim7 benchmark.

rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple
dirt cheap manner' -- Mike Galbraith.

With this cheap and smart bursty indicator, we can find the wake up
burst, and use nr_running as instant utilization in this scenario.

For other scenarios, we still use the precise CPU utilization to
judage if a domain is eligible for power scheduling.

Thanks for Mike Galbraith's idea!

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |   33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e993f1c..3db77e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4605,12 +4605,19 @@ struct sd_lb_stats {
  * Try to collect the task running number and capacity of the group.
  */
 static void get_sg_power_stats(struct sched_group *group,
-	struct sched_domain *sd, struct sg_lb_stats *sgs)
+	struct sched_domain *sd, struct sg_lb_stats *sgs, int burst)
 {
 	int i;
 
-	for_each_cpu(i, sched_group_cpus(group))
-		sgs->group_util += max_rq_util(i);
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		if (burst && rq->nr_running > 1)
+			/* use nr_running as instant utilization */
+			sgs->group_util += rq->nr_running;
+		else
+			sgs->group_util += max_rq_util(i);
+	}
 
 	sgs->group_weight = group->group_weight;
 }
@@ -4624,6 +4631,8 @@ static int is_sd_full(struct sched_domain *sd,
 	struct sched_group *group;
 	struct sg_lb_stats sgs;
 	long sd_min_delta = LONG_MAX;
+	int cpu = task_cpu(p);
+	int burst = 0;
 	unsigned int putil;
 
 	if (p->se.load.weight == p->se.avg.load_avg_contrib)
@@ -4633,15 +4642,21 @@ static int is_sd_full(struct sched_domain *sd,
 		putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT)
 				/ (p->se.avg.runnable_avg_period + 1);
 
+	if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold)
+		burst = 1;
+
 	/* Try to collect the domain's utilization */
 	group = sd->groups;
 	do {
 		long g_delta;
 
 		memset(&sgs, 0, sizeof(sgs));
-		get_sg_power_stats(group, sd, &sgs);
+		get_sg_power_stats(group, sd, &sgs, burst);
 
-		g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
+		if (burst)
+			g_delta = sgs.group_weight - sgs.group_util;
+		else
+			g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
 
 		if (g_delta > 0 && g_delta < sd_min_delta) {
 			sd_min_delta = g_delta;
@@ -4651,8 +4666,12 @@ static int is_sd_full(struct sched_domain *sd,
 		sds->sd_util += sgs.group_util;
 	} while  (group = group->next, group != sd->groups);
 
-	if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
-		return 0;
+	if (burst) {
+		if (sds->sd_util < sd->span_weight)
+			return 0;
+	} else
+		if (sds->sd_util + putil < sd->span_weight * FULL_UTIL)
+			return 0;
 
 	/* can not hold one more task in this domain */
 	return 1;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 13/19] sched: packing transitory tasks in wakeup power balancing
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (11 preceding siblings ...)
  2014-08-11 11:38 ` [RFC PATCH V2 12/19] sched: using avg_idle to detect bursty wakeup Preeti U Murthy
@ 2014-08-11 11:39 ` Preeti U Murthy
  2014-08-11 11:39 ` [RFC PATCH V2 14/19] sched: add power/performance balance allow flag Preeti U Murthy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:39 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

If the waked task is transitory enough, it will has a chance to be
packed into a cpu which is busy but still has time to care it.

For powersaving policy, only the history util < 25% task has chance to
be packed. If there is no cpu eligible to handle it, will use a idlest
cpu in leader group.

Morten Rasmussen catch a type bug. And PeterZ reminder to consider
rt_util. thanks you!

Inspired-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |   56 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3db77e8..e7a677e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4693,24 +4693,65 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough free time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+		int policy)
+{
+	int vacancy, min_vacancy = INT_MAX;
+	int leader_cpu = -1;
+	int i;
+	/* percentage of the task's util */
+	unsigned putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT)
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	/* bias toward local cpu */
+	if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+			FULL_UTIL - max_rq_util(this_cpu) - (putil << 2) > 0)
+		return this_cpu;
+
+	/* Traverse only the allowed CPUs */
+	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (i == this_cpu)
+			continue;
+
+		/* only light task allowed, putil < 25% */
+		vacancy = FULL_UTIL - max_rq_util(i) - (putil << 2);
+
+		if (vacancy > 0 && vacancy < min_vacancy) {
+			min_vacancy = vacancy;
+			leader_cpu = i;
+		}
+	}
+	return leader_cpu;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int wakeup)
 {
 	int policy;
 	int new_cpu = -1;
 
 	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
-	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		if (wakeup)
+			new_cpu = find_leader_cpu(sds->group_leader,
+							p, cpu, policy);
+		/* for fork balancing and a little busy task */
+		if (new_cpu == -1)
+			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
 	return new_cpu;
 }
 #else
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int wakeup)
 {
 	return -1;
 }
@@ -4768,13 +4809,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (tmp->flags & sd_flag) {
 			sd = tmp;
 
-			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+						sd_flag & SD_BALANCE_WAKE);
 			if (new_cpu != -1)
 				goto unlock;
 		}
 	}
 	if (affine_sd) {
-		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
 		if (new_cpu != -1)
 			goto unlock;
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 14/19] sched: add power/performance balance allow flag
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (12 preceding siblings ...)
  2014-08-11 11:39 ` [RFC PATCH V2 13/19] sched: packing transitory tasks in wakeup power balancing Preeti U Murthy
@ 2014-08-11 11:39 ` Preeti U Murthy
  2014-08-11 11:40 ` [RFC PATCH V2 15/19] sched: pull all tasks from source grp and no balance for prefer_sibling Preeti U Murthy
  2014-08-11 11:41 ` [RFC PATCH V2 16/19] sched: add new members of sd_lb_stats Preeti U Murthy
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:39 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

If a sched domain is idle enough for regular power balance, LBF_POWER_BAL
will be set, LBF_PERF_BAL will be clean. If a sched domain is busy,
their value will be set oppositely.

If the domain is suitable for power balance, but balance should not
be down by this cpu(this cpu is already idle or full), both of flags
are cleared to wait a suitable cpu to do power balance.
That mean no any balance, neither power balance nor performance balance
will be done on this cpu.

Above logical will be implemented by incoming patches.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7a677e..f9b2a21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5372,6 +5372,11 @@ enum fbq_type { regular, remote, all };
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
 
+#ifdef CONFIG_SCHED_POWER
+#define LBF_POWER_BAL  0x08    /* if power balance allowed */
+#define LBF_PERF_BAL   0x10    /* if performance balance allowed */
+#endif
+
 struct lb_env {
 	struct sched_domain	*sd;
 
@@ -6866,6 +6871,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+#ifdef CONFIG_SCHED_POWER
+		.flags		= LBF_PERF_BAL,
+#endif
 		.fbq_type	= all,
 	};
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 15/19] sched: pull all tasks from source grp and no balance for prefer_sibling
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (13 preceding siblings ...)
  2014-08-11 11:39 ` [RFC PATCH V2 14/19] sched: add power/performance balance allow flag Preeti U Murthy
@ 2014-08-11 11:40 ` Preeti U Murthy
  2014-08-11 11:41 ` [RFC PATCH V2 16/19] sched: add new members of sd_lb_stats Preeti U Murthy
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:40 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.

Also in power aware scheduling, we don't want to balance 'prefer_sibling'
groups just because local group has capacity. If the local group has no tasks
at the time, that is the power balance hope so.

Signed-off-by: Alex Shi <alex.shi@intel.com>
[Added CONFIG_SCHED_POWER switch to enable this patch]
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9b2a21..fd93eaf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6346,6 +6346,21 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_POWER
+static int get_power_policy(struct lb_env *env)
+{
+	if (env->flags & LBF_PERF_BAL)
+		return 0;
+	else
+		return 1;
+}
+#else
+static int get_power_policy(struct lb_env *env)
+{
+	return 0;
+}
+#endif /* CONFIG_SCHED_POWER */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -6358,6 +6373,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 	bool overload = false;
+	int powersave = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
@@ -6393,9 +6409,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
+		 *
+		 * In power aware scheduling, we don't care load weight and
+		 * want not to pull tasks just because local group has capacity.
 		 */
+		powersave = get_power_policy(env);
+
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
+		    sds->local_stat.group_has_free_capacity && !powersave)
 			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
@@ -6761,8 +6782,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
+#ifdef CONFIG_SCHED_POWER
+		if (rq->nr_running == 0 ||
+			(!(env->flags & LBF_POWER_BAL) && capacity_factor &&
+				rq->nr_running == 1 && wl > env->imbalance))
+ 			continue;
+#else
 		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
 			continue;
+#endif /* CONFIG_SCHED_POWER */
 
 		/*
 		 * For the load comparisons with the other cpu's, consider
@@ -6848,6 +6876,25 @@ static int should_we_balance(struct lb_env *env)
 	return balance_cpu == env->dst_cpu;
 }
 
+#ifdef CONFIG_SCHED_POWER
+static int is_busiest_eligible(struct rq *rq, struct lb_env *env)
+{
+	if (rq->nr_running > 1 ||
+		(rq->nr_running == 1 && env->flags & LBF_POWER_BAL))
+			return 1;
+	else
+		return 0;
+}
+#else
+static int is_busiest_eligible(struct rq *rq, struct lb_env *env)
+{
+	if (rq->nr_running > 1)
+		return 1;
+	else
+		return 0;
+}
+#endif /* CONFIG_SCHED_POWER */
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -6911,7 +6958,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
 	ld_moved = 0;
-	if (busiest->nr_running > 1) {
+	if (is_busiest_eligible(busiest, &env)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH V2 16/19] sched: add new members of sd_lb_stats
  2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
                   ` (14 preceding siblings ...)
  2014-08-11 11:40 ` [RFC PATCH V2 15/19] sched: pull all tasks from source grp and no balance for prefer_sibling Preeti U Murthy
@ 2014-08-11 11:41 ` Preeti U Murthy
  15 siblings, 0 replies; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-11 11:41 UTC (permalink / raw)
  To: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo
  Cc: nicolas.pitre, len.brown, yuyang.du, linaro-kernel,
	daniel.lezcano, corbet, catalin.marinas, markgross, sundar.iyer,
	linux-kernel, dietmar.eggemann, Lorenzo.Pieralisi,
	mike.turquette, akpm, paulmck, tglx

From: Alex Shi <alex.shi@intel.com>

Added 4 new members in sb_lb_stats, that will be used in incomming
power aware balance.

group_min; 	//least utliszation group in domain
min_load_per_task; //load_per_task in group_min
leader_util;	// sum utilizations of group_leader
min_util;	// sum utilizations of group_min

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd93eaf..6d40aa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4597,6 +4597,10 @@ struct sd_lb_stats {
         /* Varibles of power aware scheduling */
         unsigned int  sd_util;  /* sum utilization of this domain */
         struct sched_group *group_leader; /* Group which relieves group_min */
+        struct sched_group *group_min;  /* Least loaded group in sd */
+        unsigned long min_load_per_task; /* load_per_task in group_min */
+        unsigned int  leader_util;      /* sum utilizations of group_leader */
+        unsigned int  min_util;         /* sum utilizations of group_min */
 #endif
 };
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning
  2014-08-11 11:32 ` [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning Preeti U Murthy
@ 2014-08-18 15:39   ` Nicolas Pitre
  2014-08-18 17:26     ` Preeti U Murthy
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2014-08-18 15:39 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On Mon, 11 Aug 2014, Preeti U Murthy wrote:

> As a first step towards improving the power awareness of the scheduler,
> this patch enables a "dumb" state where all power management is turned off.
> Whatever additionally we put into the kernel for cpu power management must
> do better than this in terms of performance as well as powersavings.
> This will enable us to benchmark and optimize the power aware scheduler
> from scratch.If we are to benchmark it against the performance of the
> existing design, we will get sufficiently distracted by the performance
> numbers and get steered away from a sane design.

I understand your goal here, but people *will* compare performance 
between the old and the new design anyway.  So I think it would be a 
better approach to simply let the existing code be and create a new 
scheduler-based governor that can be swapped with the existing ones at 
run time.  Eventually we'll want average users to test and compare this, 
and asking them to recompile a second kernel and reboot between them 
might get unwieldy to many people.

And by allowing both to coexist at run time, we're making sure both the 
old and the new code are built helping not breaking the old code.  And 
that will also cut down on the number of #ifdefs in many places.

In other words, CONFIG_SCHED_POWER is needed to select the scheduler 
based governor but it shouldn't force the existing code disabled.


> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> 
>  arch/powerpc/Kconfig                   |    1 +
>  arch/powerpc/platforms/powernv/Kconfig |   12 ++++++------
>  drivers/cpufreq/Kconfig                |    2 ++
>  drivers/cpuidle/Kconfig                |    2 ++
>  kernel/Kconfig.sched                   |   11 +++++++++++
>  5 files changed, 22 insertions(+), 6 deletions(-)
>  create mode 100644 kernel/Kconfig.sched
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 80b94b0..b7fe36a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -301,6 +301,7 @@ config HIGHMEM
>  
>  source kernel/Kconfig.hz
>  source kernel/Kconfig.preempt
> +source kernel/Kconfig.sched
>  source "fs/Kconfig.binfmt"
>  
>  config HUGETLB_PAGE_SIZE_VARIABLE
> diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
> index 45a8ed0..b0ef8b1 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -11,12 +11,12 @@ config PPC_POWERNV
>  	select PPC_UDBG_16550
>  	select PPC_SCOM
>  	select ARCH_RANDOM
> -	select CPU_FREQ
> -	select CPU_FREQ_GOV_PERFORMANCE
> -	select CPU_FREQ_GOV_POWERSAVE
> -	select CPU_FREQ_GOV_USERSPACE
> -	select CPU_FREQ_GOV_ONDEMAND
> -	select CPU_FREQ_GOV_CONSERVATIVE
> +	select CPU_FREQ if !SCHED_POWER
> +	select CPU_FREQ_GOV_PERFORMANCE if CPU_FREQ
> +	select CPU_FREQ_GOV_POWERSAVE if CPU_FREQ
> +	select CPU_FREQ_GOV_USERSPACE if CPU_FREQ
> +	select CPU_FREQ_GOV_ONDEMAND if CPU_FREQ
> +	select CPU_FREQ_GOV_CONSERVATIVE if CPU_FREQ
>  	select PPC_DOORBELL
>  	default y
>  
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index ffe350f..8976fd6 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -2,6 +2,7 @@ menu "CPU Frequency scaling"
>  
>  config CPU_FREQ
>  	bool "CPU Frequency scaling"
> +	depends on !SCHED_POWER
>  	help
>  	  CPU Frequency scaling allows you to change the clock speed of 
>  	  CPUs on the fly. This is a nice method to save power, because 
> @@ -12,6 +13,7 @@ config CPU_FREQ
>  	  (see below) after boot, or use a userspace tool.
>  
>  	  For details, take a look at <file:Documentation/cpu-freq>.
> +	  This feature will turn off if power aware scheduling is enabled.
>  
>  	  If in doubt, say N.
>  
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 32748c3..2c4ac79 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -3,6 +3,7 @@ menu "CPU Idle"
>  config CPU_IDLE
>  	bool "CPU idle PM support"
>  	default y if ACPI || PPC_PSERIES
> +	depends on !SCHED_POWER
>  	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>  	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
>  	help
> @@ -11,6 +12,7 @@ config CPU_IDLE
>  	  governors that can be swapped during runtime.
>  
>  	  If you're using an ACPI-enabled platform, you should say Y here.
> +	  This feature will turn off if power aware scheduling is enabled.
>  
>  if CPU_IDLE
>  
> diff --git a/kernel/Kconfig.sched b/kernel/Kconfig.sched
> new file mode 100644
> index 0000000..374454c
> --- /dev/null
> +++ b/kernel/Kconfig.sched
> @@ -0,0 +1,11 @@
> +menu "Power Aware Scheduling"
> +
> +config SCHED_POWER
> +        bool "Power Aware Scheduler"
> +        default n
> +        help
> +           Select this to enable the new power aware scheduler.
> +endmenu
> +
> +
> +
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler
  2014-08-11 11:33 ` [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler Preeti U Murthy
@ 2014-08-18 15:54   ` Nicolas Pitre
  2014-08-18 17:19     ` Preeti U Murthy
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2014-08-18 15:54 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On Mon, 11 Aug 2014, Preeti U Murthy wrote:

> The goal of the power aware scheduling design is to integrate all
> policy, metrics and averaging into the scheduler. Today the
> cpu power management is fragmented and hence inconsistent.
> 
> As a first step towards this integration, rid the cpuidle state management
> of the governors. Retain only the cpuidle driver in the cpu idle
> susbsystem which acts as an interface between the scheduler and low
> level platform specific cpuidle drivers. For all decision making around
> selection of idle states,the cpuidle driver falls back to the scheduler.
> 
> The current algorithm for idle state selection is the same as the logic used
> by the menu governor. However going ahead the heuristics will be tuned and
> improved upon with metrics better known to the scheduler.

I'd strongly suggest a different approach here.  Instead of copying the 
menu governor code and tweaking it afterwards, it would be cleaner to 
literally start from scratch with a new governor.  Said new governor 
would grow inside the scheduler with more design freedom instead of 
being strapped on the side.

By copying existing code, the chance for cruft to remain for a long time 
is close to 100%. We already have one copy of it, let's keep it working 
and start afresh instead.

By starting clean it is way easier to explain and justify additions to a 
new design than convincing ourselves about the removal of no longer 
needed pieces from a legacy design.


> 
> Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> 
>  drivers/cpuidle/Kconfig           |   12 +
>  drivers/cpuidle/cpuidle-powernv.c |    2 
>  drivers/cpuidle/cpuidle.c         |   65 ++++-
>  include/linux/sched.h             |    9 +
>  kernel/sched/Makefile             |    1 
>  kernel/sched/power.c              |  480 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 554 insertions(+), 15 deletions(-)
>  create mode 100644 kernel/sched/power.c
> 
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 2c4ac79..4fa4cb1 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -3,16 +3,14 @@ menu "CPU Idle"
>  config CPU_IDLE
>  	bool "CPU idle PM support"
>  	default y if ACPI || PPC_PSERIES
> -	depends on !SCHED_POWER
> -	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
> -	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
> +	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
> +	select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER)
>  	help
>  	  CPU idle is a generic framework for supporting software-controlled
>  	  idle processor power management.  It includes modular cross-platform
>  	  governors that can be swapped during runtime.
>  
>  	  If you're using an ACPI-enabled platform, you should say Y here.
> -	  This feature will turn off if power aware scheduling is enabled.
>  
>  if CPU_IDLE
>  
> @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS
>  config CPU_IDLE_GOV_LADDER
>  	bool "Ladder governor (for periodic timer tick)"
>  	default y
> +	depends on !SCHED_POWER
> +	help
> +	  This feature will turn off if power aware scheduling is enabled.
>  
>  config CPU_IDLE_GOV_MENU
>  	bool "Menu governor (for tickless system)"
>  	default y
> +	depends on !SCHED_POWER
> +	help
> +	  This feature will turn off if power aware scheduling is enabled.
>  
>  menu "ARM CPU Idle Drivers"
>  depends on ARM
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index fa79392..95ef533 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
>  	unsigned long new_lpcr;
>  
>  	if (powersave_nap < 2)
> -		return;
> +		return 0;
>  	if (unlikely(system_state < SYSTEM_RUNNING))
>  		return index;
>  
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index ee9df5e..38fb213 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>  	return entered_state;
>  }
>  
> +#ifdef CONFIG_SCHED_POWER
> +static int __cpuidle_select(struct cpuidle_driver *drv,
> +				struct cpuidle_device *dev)
> +{
> +	return cpuidle_sched_select(drv, dev);
> +}
> +#else
> +static int __cpuidle_select(struct cpuidle_driver *drv,
> +				struct cpuidle_device *dev)
> +{
> +	return cpuidle_curr_governor->select(drv, dev);	
> +}
> +#endif
>  /**
>   * cpuidle_select - ask the cpuidle framework to choose an idle state
>   *
> @@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	if (unlikely(use_deepest_state))
>  		return cpuidle_find_deepest_state(drv, dev);
>  
> -	return cpuidle_curr_governor->select(drv, dev);
> +	return __cpuidle_select(drv, dev);
>  }
>  
>  /**
> @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	return cpuidle_enter_state(dev, drv, index);
>  }
>  
> +#ifdef CONFIG_SCHED_POWER
> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
> +{
> +	cpuidle_sched_reflect(dev, index);
> +}
> +#else
> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
> +{
> +	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
> +		cpuidle_curr_governor->reflect(dev, index);
> +}
> +#endif
>  /**
>   * cpuidle_reflect - tell the underlying governor what was the state
>   * we were in
> @@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>   */
>  void cpuidle_reflect(struct cpuidle_device *dev, int index)
>  {
> -	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
> -		cpuidle_curr_governor->reflect(dev, index);
> +	__cpuidle_reflect(dev, index);
>  }
>  
>  /**
> @@ -265,6 +289,28 @@ void cpuidle_resume(void)
>  	mutex_unlock(&cpuidle_lock);
>  }
>  
> +#ifdef CONFIG_SCHED_POWER
> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
> +					struct cpuidle_device *dev, int enable)
> +{
> +	if (enable)
> +		return cpuidle_sched_enable_device(drv, dev);
> +	else
> +		return 0;
> +}
> +#else
> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
> +					struct cpuidle_device *dev, int enable)
> +{
> +	if (!cpuidle_curr_governor)
> +		return -EIO;
> +
> +	if (enable && cpuidle_curr_governor->enable)
> +		return cpuidle_curr_governor->enable(drv, dev);
> +	else if (cpuidle_curr_governor->disable)
> +		cpuidle_curr_governor->disable(drv, dev);
> +}
> +#endif
>  /**
>   * cpuidle_enable_device - enables idle PM for a CPU
>   * @dev: the CPU
> @@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>  
>  	drv = cpuidle_get_cpu_driver(dev);
>  
> -	if (!drv || !cpuidle_curr_governor)
> +	if (!drv)
>  		return -EIO;
>  
>  	if (!dev->registered)
> @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>  	if (ret)
>  		return ret;
>  
> -	if (cpuidle_curr_governor->enable &&
> -	    (ret = cpuidle_curr_governor->enable(drv, dev)))
> +	ret = cpuidle_check_governor(drv, dev, 1);
> +	if (ret)
>  		goto fail_sysfs;
>  
>  	smp_wmb();
> @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev)
>  	if (!dev || !dev->enabled)
>  		return;
>  
> -	if (!drv || !cpuidle_curr_governor)
> +	if (!drv)
>  		return;
> -
> +	
>  	dev->enabled = 0;
>  
> -	if (cpuidle_curr_governor->disable)
> -		cpuidle_curr_governor->disable(drv, dev);
> +	cpuidle_check_governor(drv, dev, 0);
>  
>  	cpuidle_remove_device_sysfs(dev);
>  	enabled_devices--;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7c19d55..5dd99b5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -26,6 +26,7 @@ struct sched_param {
>  #include <linux/nodemask.h>
>  #include <linux/mm_types.h>
>  #include <linux/preempt_mask.h>
> +#include <linux/cpuidle.h>
>  
>  #include <asm/page.h>
>  #include <asm/ptrace.h>
> @@ -846,6 +847,14 @@ enum cpu_idle_type {
>  	CPU_MAX_IDLE_TYPES
>  };
>  
> +#ifdef CONFIG_SCHED_POWER
> +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index);
> +extern int cpuidle_sched_select(struct cpuidle_driver *drv,
> +					struct cpuidle_device *dev);
> +extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
> +						struct cpuidle_device *dev);
> +#endif
> +
>  /*
>   * Increase resolution of cpu_capacity calculations
>   */
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index ab32b7b..5b8e469 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_SCHED_POWER) += power.o
> diff --git a/kernel/sched/power.c b/kernel/sched/power.c
> new file mode 100644
> index 0000000..63c9276
> --- /dev/null
> +++ b/kernel/sched/power.c
> @@ -0,0 +1,480 @@
> +/*
> + * power.c - the power aware scheduler
> + *
> + * Author:
> + *        Preeti U. Murthy <preeti@linux.vnet.ibm.com>
> + *
> + * This code is a replica of drivers/cpuidle/governors/menu.c
> + * To make the transition to power aware scheduler away from
> + * the cpuidle governor model easy, we do exactly what the
> + * governors do for now. Going ahead the heuristics will be
> + * tuned and improved upon.
> + *
> + * This code is licenced under the GPL version 2 as described
> + * in the COPYING file that acompanies the Linux Kernel.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/cpuidle.h>
> +#include <linux/pm_qos.h>
> +#include <linux/time.h>
> +#include <linux/ktime.h>
> +#include <linux/hrtimer.h>
> +#include <linux/tick.h>
> +#include <linux/sched.h>
> +#include <linux/math64.h>
> +#include <linux/module.h>
> +
> +/*
> + * Please note when changing the tuning values:
> + * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
> + * a scaling operation multiplication may overflow on 32 bit platforms.
> + * In that case, #define RESOLUTION as ULL to get 64 bit result:
> + * #define RESOLUTION 1024ULL
> + *
> + * The default values do not overflow.
> + */
> +#define BUCKETS 12
> +#define INTERVALS 8
> +#define RESOLUTION 1024
> +#define DECAY 8
> +#define MAX_INTERESTING 50000
> +
> +
> +/*
> + * Concepts and ideas behind the power aware scheduler
> + *
> + * For the power aware scheduler, there are 3 decision factors for picking a C
> + * state:
> + * 1) Energy break even point
> + * 2) Performance impact
> + * 3) Latency tolerance (from pmqos infrastructure)
> + * These these three factors are treated independently.
> + *
> + * Energy break even point
> + * -----------------------
> + * C state entry and exit have an energy cost, and a certain amount of time in
> + * the  C state is required to actually break even on this cost. CPUIDLE
> + * provides us this duration in the "target_residency" field. So all that we
> + * need is a good prediction of how long we'll be idle. Like the traditional
> + * governors, we start with the actual known "next timer event" time.
> + *
> + * Since there are other source of wakeups (interrupts for example) than
> + * the next timer event, this estimation is rather optimistic. To get a
> + * more realistic estimate, a correction factor is applied to the estimate,
> + * that is based on historic behavior. For example, if in the past the actual
> + * duration always was 50% of the next timer tick, the correction factor will
> + * be 0.5.
> + *
> + * power aware scheduler uses a running average for this correction factor,
> + * however it uses a set of factors, not just a single factor. This stems from
> + * the realization that the ratio is dependent on the order of magnitude of the
> + * expected duration; if we expect 500 milliseconds of idle time the likelihood of
> + * getting an interrupt very early is much higher than if we expect 50 micro
> + * seconds of idle time. A second independent factor that has big impact on
> + * the actual factor is if there is (disk) IO outstanding or not.
> + * (as a special twist, we consider every sleep longer than 50 milliseconds
> + * as perfect; there are no power gains for sleeping longer than this)
> + *
> + * For these two reasons we keep an array of 12 independent factors, that gets
> + * indexed based on the magnitude of the expected duration as well as the
> + * "is IO outstanding" property.
> + *
> + * Repeatable-interval-detector
> + * ----------------------------
> + * There are some cases where "next timer" is a completely unusable predictor:
> + * Those cases where the interval is fixed, for example due to hardware
> + * interrupt mitigation, but also due to fixed transfer rate devices such as
> + * mice.
> + * For this, we use a different predictor: We track the duration of the last 8
> + * intervals and if the stand deviation of these 8 intervals is below a
> + * threshold value, we use the average of these intervals as prediction.
> + *
> + * Limiting Performance Impact
> + * ---------------------------
> + * C states, especially those with large exit latencies, can have a real
> + * noticeable impact on workloads, which is not acceptable for most sysadmins,
> + * and in addition, less performance has a power price of its own.
> + *
> + * As a general rule of thumb, power aware sched assumes that the following
> + * heuristic holds:
> + *     The busier the system, the less impact of C states is acceptable
> + *
> + * This rule-of-thumb is implemented using a performance-multiplier:
> + * If the exit latency times the performance multiplier is longer than
> + * the predicted duration, the C state is not considered a candidate
> + * for selection due to a too high performance impact. So the higher
> + * this multiplier is, the longer we need to be idle to pick a deep C
> + * state, and thus the less likely a busy CPU will hit such a deep
> + * C state.
> + *
> + * Two factors are used in determing this multiplier:
> + * a value of 10 is added for each point of "per cpu load average" we have.
> + * a value of 5 points is added for each process that is waiting for
> + * IO on this CPU.
> + * (these values are experimentally determined)
> + *
> + * The load average factor gives a longer term (few seconds) input to the
> + * decision, while the iowait value gives a cpu local instantanious input.
> + * The iowait factor may look low, but realize that this is also already
> + * represented in the system load average.
> + *
> + */
> +
> +struct sched_cpuidle_info {
> +	int		last_state_idx;
> +	int             needs_update;
> +
> +	unsigned int	next_timer_us;
> +	unsigned int	predicted_us;
> +	unsigned int	bucket;
> +	unsigned int	correction_factor[BUCKETS];
> +	unsigned int	intervals[INTERVALS];
> +	int		interval_ptr;
> +};
> +
> +
> +#define LOAD_INT(x) ((x) >> FSHIFT)
> +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
> +
> +static int get_loadavg(void)
> +{
> +	unsigned long this = this_cpu_load();
> +
> +
> +	return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
> +}
> +
> +static inline int which_bucket(unsigned int duration)
> +{
> +	int bucket = 0;
> +
> +	/*
> +	 * We keep two groups of stats; one with no
> +	 * IO pending, one without.
> +	 * This allows us to calculate
> +	 * E(duration)|iowait
> +	 */
> +	if (nr_iowait_cpu(smp_processor_id()))
> +		bucket = BUCKETS/2;
> +
> +	if (duration < 10)
> +		return bucket;
> +	if (duration < 100)
> +		return bucket + 1;
> +	if (duration < 1000)
> +		return bucket + 2;
> +	if (duration < 10000)
> +		return bucket + 3;
> +	if (duration < 100000)
> +		return bucket + 4;
> +	return bucket + 5;
> +}
> +
> +/*
> + * Return a multiplier for the exit latency that is intended
> + * to take performance requirements into account.
> + * The more performance critical we estimate the system
> + * to be, the higher this multiplier, and thus the higher
> + * the barrier to go to an expensive C state.
> + */
> +static inline int performance_multiplier(void)
> +{
> +	int mult = 1;
> +
> +	/* for higher loadavg, we are more reluctant */
> +
> +	mult += 2 * get_loadavg();
> +
> +	/* for IO wait tasks (per cpu!) we add 5x each */
> +	mult += 10 * nr_iowait_cpu(smp_processor_id());
> +
> +	return mult;
> +}
> +
> +static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
> +
> +static void cpuidle_sched_update(struct cpuidle_driver *drv,
> +					struct cpuidle_device *dev);
> +
> +/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */
> +static u64 div_round64(u64 dividend, u32 divisor)
> +{
> +	return div_u64(dividend + (divisor / 2), divisor);
> +}
> +
> +/*
> + * Try detecting repeating patterns by keeping track of the last 8
> + * intervals, and checking if the standard deviation of that set
> + * of points is below a threshold. If it is... then use the
> + * average of these 8 points as the estimated value.
> + */
> +static void get_typical_interval(struct sched_cpuidle_info *data)
> +{
> +	int i, divisor;
> +	unsigned int max, thresh;
> +	uint64_t avg, stddev;
> +
> +	thresh = UINT_MAX; /* Discard outliers above this value */
> +
> +again:
> +
> +	/* First calculate the average of past intervals */
> +	max = 0;
> +	avg = 0;
> +	divisor = 0;
> +	for (i = 0; i < INTERVALS; i++) {
> +		unsigned int value = data->intervals[i];
> +		if (value <= thresh) {
> +			avg += value;
> +			divisor++;
> +			if (value > max)
> +				max = value;
> +		}
> +	}
> +	do_div(avg, divisor);
> +
> +	/* Then try to determine standard deviation */
> +	stddev = 0;
> +	for (i = 0; i < INTERVALS; i++) {
> +		unsigned int value = data->intervals[i];
> +		if (value <= thresh) {
> +			int64_t diff = value - avg;
> +			stddev += diff * diff;
> +		}
> +	}
> +	do_div(stddev, divisor);
> +	/*
> +	 * The typical interval is obtained when standard deviation is small
> +	 * or standard deviation is small compared to the average interval.
> +	 *
> +	 * int_sqrt() formal parameter type is unsigned long. When the
> +	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
> +	 * the resulting squared standard deviation exceeds the input domain
> +	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
> +	 * In such case reject the candidate average.
> +	 *
> +	 * Use this result only if there is no timer to wake us up sooner.
> +	 */
> +	if (likely(stddev <= ULONG_MAX)) {
> +		stddev = int_sqrt(stddev);
> +		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
> +							|| stddev <= 20) {
> +			if (data->next_timer_us > avg)
> +				data->predicted_us = avg;
> +			return;
> +		}
> +	}
> +
> +	/*
> +	 * If we have outliers to the upside in our distribution, discard
> +	 * those by setting the threshold to exclude these outliers, then
> +	 * calculate the average and standard deviation again. Once we get
> +	 * down to the bottom 3/4 of our samples, stop excluding samples.
> +	 *
> +	 * This can deal with workloads that have long pauses interspersed
> +	 * with sporadic activity with a bunch of short pauses.
> +	 */
> +	if ((divisor * 4) <= INTERVALS * 3)
> +		return;
> +
> +	thresh = max - 1;
> +	goto again;
> +}
> +
> +/**
> + * cpuidle_sched_select - selects the next idle state to enter
> + * @drv: cpuidle driver containing state data
> + * @dev: the CPU
> + */
> +int cpuidle_sched_select(struct cpuidle_driver *drv,
> +				struct cpuidle_device *dev)
> +{
> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
> +	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
> +	int i;
> +	unsigned int interactivity_req;
> +	struct timespec t;
> +
> +	if (data->needs_update) {
> +		cpuidle_sched_update(drv, dev);
> +		data->needs_update = 0;
> +	}
> +
> +	data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
> +
> +	/* Special case when user has set very strict latency requirement */
> +	if (unlikely(latency_req == 0))
> +		return 0;
> +
> +	/* determine the expected residency time, round up */
> +	t = ktime_to_timespec(tick_nohz_get_sleep_length());
> +	data->next_timer_us =
> +		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
> +
> +
> +	data->bucket = which_bucket(data->next_timer_us);
> +
> +	/*
> +	 * Force the result of multiplication to be 64 bits even if both
> +	 * operands are 32 bits.
> +	 * Make sure to round up for half microseconds.
> +	 */
> +	data->predicted_us = div_round64((uint64_t)data->next_timer_us *
> +					 data->correction_factor[data->bucket],
> +					 RESOLUTION * DECAY);
> +
> +	get_typical_interval(data);
> +
> +	/*
> +	 * Performance multiplier defines a minimum predicted idle
> +	 * duration / latency ratio. Adjust the latency limit if
> +	 * necessary.
> +	 */
> +	interactivity_req = data->predicted_us / performance_multiplier();
> +	if (latency_req > interactivity_req)
> +		latency_req = interactivity_req;
> +
> +	/*
> +	 * We want to default to C1 (hlt), not to busy polling
> +	 * unless the timer is happening really really soon.
> +	 */
> +	if (data->next_timer_us > 5 &&
> +	    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
> +		dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
> +		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
> +
> +	/*
> +	 * Find the idle state with the lowest power while satisfying
> +	 * our constraints.
> +	 */
> +	for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
> +		struct cpuidle_state *s = &drv->states[i];
> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
> +
> +		if (s->disabled || su->disable)
> +			continue;
> +		if (s->target_residency > data->predicted_us)
> +			continue;
> +		if (s->exit_latency > latency_req)
> +			continue;
> +
> +		data->last_state_idx = i;
> +	}
> +
> +	return data->last_state_idx;
> +}
> +
> +/**
> + * cpuidle_sched_reflect - records that data structures need update
> + * @dev: the CPU
> + * @index: the index of actual entered state
> + *
> + * NOTE: it's important to be fast here because this operation will add to
> + *       the overall exit latency.
> + */
> +void cpuidle_sched_reflect(struct cpuidle_device *dev, int index)
> +{
> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
> +	data->last_state_idx = index;
> +	if (index >= 0)
> +		data->needs_update = 1;
> +}
> +
> +/**
> + * cpuidle_sched_update - attempts to guess what happened after entry
> + * @drv: cpuidle driver containing state data
> + * @dev: the CPU
> + */
> +static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> +{
> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
> +	int last_idx = data->last_state_idx;
> +	struct cpuidle_state *target = &drv->states[last_idx];
> +	unsigned int measured_us;
> +	unsigned int new_factor;
> +
> +	/*
> +	 * Try to figure out how much time passed between entry to low
> +	 * power state and occurrence of the wakeup event.
> +	 *
> +	 * If the entered idle state didn't support residency measurements,
> +	 * we are basically lost in the dark how much time passed.
> +	 * As a compromise, assume we slept for the whole expected time.
> +	 *
> +	 * Any measured amount of time will include the exit latency.
> +	 * Since we are interested in when the wakeup begun, not when it
> +	 * was completed, we must subtract the exit latency. However, if
> +	 * the measured amount of time is less than the exit latency,
> +	 * assume the state was never reached and the exit latency is 0.
> +	 */
> +	if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
> +		/* Use timer value as is */
> +		measured_us = data->next_timer_us;
> +
> +	} else {
> +		/* Use measured value */
> +		measured_us = cpuidle_get_last_residency(dev);
> +
> +		/* Deduct exit latency */
> +		if (measured_us > target->exit_latency)
> +			measured_us -= target->exit_latency;
> +
> +		/* Make sure our coefficients do not exceed unity */
> +		if (measured_us > data->next_timer_us)
> +			measured_us = data->next_timer_us;
> +	}
> +
> +	/* Update our correction ratio */
> +	new_factor = data->correction_factor[data->bucket];
> +	new_factor -= new_factor / DECAY;
> +
> +	if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
> +		new_factor += RESOLUTION * measured_us / data->next_timer_us;
> +	else
> +		/*
> +		 * we were idle so long that we count it as a perfect
> +		 * prediction
> +		 */
> +		new_factor += RESOLUTION;
> +
> +	/*
> +	 * We don't want 0 as factor; we always want at least
> +	 * a tiny bit of estimated time. Fortunately, due to rounding,
> +	 * new_factor will stay nonzero regardless of measured_us values
> +	 * and the compiler can eliminate this test as long as DECAY > 1.
> +	 */
> +	if (DECAY == 1 && unlikely(new_factor == 0))
> +		new_factor = 1;
> +
> +	data->correction_factor[data->bucket] = new_factor;
> +
> +	/* update the repeating-pattern data */
> +	data->intervals[data->interval_ptr++] = measured_us;
> +	if (data->interval_ptr >= INTERVALS)
> +		data->interval_ptr = 0;
> +}
> +
> +/**
> + * cpuidle_sched_enable_device - scans a CPU's states and does setup
> + * @drv: cpuidle driver
> + * @dev: the CPU
> + */
> +int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
> +				struct cpuidle_device *dev)
> +{
> +	struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
> +	int i;
> +
> +	memset(data, 0, sizeof(struct sched_cpuidle_info));
> +
> +	/*
> +	 * if the correction factor is 0 (eg first time init or cpu hotplug
> +	 * etc), we actually want to start out with a unity factor.
> +	 */
> +	for(i = 0; i < BUCKETS; i++)
> +		data->correction_factor[i] = RESOLUTION * DECAY;
> +
> +	return 0;
> +}
> +
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler
  2014-08-18 15:54   ` Nicolas Pitre
@ 2014-08-18 17:19     ` Preeti U Murthy
  2014-08-18 18:25       ` Nicolas Pitre
  0 siblings, 1 reply; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-18 17:19 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
> On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> 
>> The goal of the power aware scheduling design is to integrate all
>> policy, metrics and averaging into the scheduler. Today the
>> cpu power management is fragmented and hence inconsistent.
>>
>> As a first step towards this integration, rid the cpuidle state management
>> of the governors. Retain only the cpuidle driver in the cpu idle
>> susbsystem which acts as an interface between the scheduler and low
>> level platform specific cpuidle drivers. For all decision making around
>> selection of idle states,the cpuidle driver falls back to the scheduler.
>>
>> The current algorithm for idle state selection is the same as the logic used
>> by the menu governor. However going ahead the heuristics will be tuned and
>> improved upon with metrics better known to the scheduler.
> 
> I'd strongly suggest a different approach here.  Instead of copying the 
> menu governor code and tweaking it afterwards, it would be cleaner to 
> literally start from scratch with a new governor.  Said new governor 
> would grow inside the scheduler with more design freedom instead of 
> being strapped on the side.
> 
> By copying existing code, the chance for cruft to remain for a long time 
> is close to 100%. We already have one copy of it, let's keep it working 
> and start afresh instead.
> 
> By starting clean it is way easier to explain and justify additions to a 
> new design than convincing ourselves about the removal of no longer 
> needed pieces from a legacy design.

Ok. The reason I did it this way was that I did not find anything
grossly wrong in the current cpuidle governor algorithm. Of course this
can be improved but I did not see strong reasons to completely wipe it
away. I see good scope to improve upon the existing algorithm with
additional knowledge of *the idle states being mapped to scheduling
domains*. This will in itself give us a better algorithm and does not
mandate significant changes from the current algorithm. So I really
don't see why we need to start from scratch.

The primary issue that I found was that with the goal being power aware
scheduler we must ensure that the possibility of a governor getting
registered with cpuidle to choose idle states no longer will exist. The
reason being there is just *one entity who will take this decision and
there is no option about it*. This patch intends to bring the focus to
this specific detail.

Regards
Preeti U Murthy
> 
> 
>>
>> Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
>>
>> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> ---
>>
>>  drivers/cpuidle/Kconfig           |   12 +
>>  drivers/cpuidle/cpuidle-powernv.c |    2 
>>  drivers/cpuidle/cpuidle.c         |   65 ++++-
>>  include/linux/sched.h             |    9 +
>>  kernel/sched/Makefile             |    1 
>>  kernel/sched/power.c              |  480 +++++++++++++++++++++++++++++++++++++
>>  6 files changed, 554 insertions(+), 15 deletions(-)
>>  create mode 100644 kernel/sched/power.c
>>
>> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
>> index 2c4ac79..4fa4cb1 100644
>> --- a/drivers/cpuidle/Kconfig
>> +++ b/drivers/cpuidle/Kconfig
>> @@ -3,16 +3,14 @@ menu "CPU Idle"
>>  config CPU_IDLE
>>  	bool "CPU idle PM support"
>>  	default y if ACPI || PPC_PSERIES
>> -	depends on !SCHED_POWER
>> -	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>> -	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
>> +	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
>> +	select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER)
>>  	help
>>  	  CPU idle is a generic framework for supporting software-controlled
>>  	  idle processor power management.  It includes modular cross-platform
>>  	  governors that can be swapped during runtime.
>>  
>>  	  If you're using an ACPI-enabled platform, you should say Y here.
>> -	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  if CPU_IDLE
>>  
>> @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS
>>  config CPU_IDLE_GOV_LADDER
>>  	bool "Ladder governor (for periodic timer tick)"
>>  	default y
>> +	depends on !SCHED_POWER
>> +	help
>> +	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  config CPU_IDLE_GOV_MENU
>>  	bool "Menu governor (for tickless system)"
>>  	default y
>> +	depends on !SCHED_POWER
>> +	help
>> +	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  menu "ARM CPU Idle Drivers"
>>  depends on ARM
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index fa79392..95ef533 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
>>  	unsigned long new_lpcr;
>>  
>>  	if (powersave_nap < 2)
>> -		return;
>> +		return 0;
>>  	if (unlikely(system_state < SYSTEM_RUNNING))
>>  		return index;
>>  
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index ee9df5e..38fb213 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>>  	return entered_state;
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static int __cpuidle_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	return cpuidle_sched_select(drv, dev);
>> +}
>> +#else
>> +static int __cpuidle_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	return cpuidle_curr_governor->select(drv, dev);	
>> +}
>> +#endif
>>  /**
>>   * cpuidle_select - ask the cpuidle framework to choose an idle state
>>   *
>> @@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>>  	if (unlikely(use_deepest_state))
>>  		return cpuidle_find_deepest_state(drv, dev);
>>  
>> -	return cpuidle_curr_governor->select(drv, dev);
>> +	return __cpuidle_select(drv, dev);
>>  }
>>  
>>  /**
>> @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>  	return cpuidle_enter_state(dev, drv, index);
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	cpuidle_sched_reflect(dev, index);
>> +}
>> +#else
>> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
>> +		cpuidle_curr_governor->reflect(dev, index);
>> +}
>> +#endif
>>  /**
>>   * cpuidle_reflect - tell the underlying governor what was the state
>>   * we were in
>> @@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>   */
>>  void cpuidle_reflect(struct cpuidle_device *dev, int index)
>>  {
>> -	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
>> -		cpuidle_curr_governor->reflect(dev, index);
>> +	__cpuidle_reflect(dev, index);
>>  }
>>  
>>  /**
>> @@ -265,6 +289,28 @@ void cpuidle_resume(void)
>>  	mutex_unlock(&cpuidle_lock);
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev, int enable)
>> +{
>> +	if (enable)
>> +		return cpuidle_sched_enable_device(drv, dev);
>> +	else
>> +		return 0;
>> +}
>> +#else
>> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev, int enable)
>> +{
>> +	if (!cpuidle_curr_governor)
>> +		return -EIO;
>> +
>> +	if (enable && cpuidle_curr_governor->enable)
>> +		return cpuidle_curr_governor->enable(drv, dev);
>> +	else if (cpuidle_curr_governor->disable)
>> +		cpuidle_curr_governor->disable(drv, dev);
>> +}
>> +#endif
>>  /**
>>   * cpuidle_enable_device - enables idle PM for a CPU
>>   * @dev: the CPU
>> @@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>>  
>>  	drv = cpuidle_get_cpu_driver(dev);
>>  
>> -	if (!drv || !cpuidle_curr_governor)
>> +	if (!drv)
>>  		return -EIO;
>>  
>>  	if (!dev->registered)
>> @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>>  	if (ret)
>>  		return ret;
>>  
>> -	if (cpuidle_curr_governor->enable &&
>> -	    (ret = cpuidle_curr_governor->enable(drv, dev)))
>> +	ret = cpuidle_check_governor(drv, dev, 1);
>> +	if (ret)
>>  		goto fail_sysfs;
>>  
>>  	smp_wmb();
>> @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev)
>>  	if (!dev || !dev->enabled)
>>  		return;
>>  
>> -	if (!drv || !cpuidle_curr_governor)
>> +	if (!drv)
>>  		return;
>> -
>> +	
>>  	dev->enabled = 0;
>>  
>> -	if (cpuidle_curr_governor->disable)
>> -		cpuidle_curr_governor->disable(drv, dev);
>> +	cpuidle_check_governor(drv, dev, 0);
>>  
>>  	cpuidle_remove_device_sysfs(dev);
>>  	enabled_devices--;
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 7c19d55..5dd99b5 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -26,6 +26,7 @@ struct sched_param {
>>  #include <linux/nodemask.h>
>>  #include <linux/mm_types.h>
>>  #include <linux/preempt_mask.h>
>> +#include <linux/cpuidle.h>
>>  
>>  #include <asm/page.h>
>>  #include <asm/ptrace.h>
>> @@ -846,6 +847,14 @@ enum cpu_idle_type {
>>  	CPU_MAX_IDLE_TYPES
>>  };
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index);
>> +extern int cpuidle_sched_select(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev);
>> +extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
>> +						struct cpuidle_device *dev);
>> +#endif
>> +
>>  /*
>>   * Increase resolution of cpu_capacity calculations
>>   */
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index ab32b7b..5b8e469 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> +obj-$(CONFIG_SCHED_POWER) += power.o
>> diff --git a/kernel/sched/power.c b/kernel/sched/power.c
>> new file mode 100644
>> index 0000000..63c9276
>> --- /dev/null
>> +++ b/kernel/sched/power.c
>> @@ -0,0 +1,480 @@
>> +/*
>> + * power.c - the power aware scheduler
>> + *
>> + * Author:
>> + *        Preeti U. Murthy <preeti@linux.vnet.ibm.com>
>> + *
>> + * This code is a replica of drivers/cpuidle/governors/menu.c
>> + * To make the transition to power aware scheduler away from
>> + * the cpuidle governor model easy, we do exactly what the
>> + * governors do for now. Going ahead the heuristics will be
>> + * tuned and improved upon.
>> + *
>> + * This code is licenced under the GPL version 2 as described
>> + * in the COPYING file that acompanies the Linux Kernel.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/cpuidle.h>
>> +#include <linux/pm_qos.h>
>> +#include <linux/time.h>
>> +#include <linux/ktime.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/tick.h>
>> +#include <linux/sched.h>
>> +#include <linux/math64.h>
>> +#include <linux/module.h>
>> +
>> +/*
>> + * Please note when changing the tuning values:
>> + * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
>> + * a scaling operation multiplication may overflow on 32 bit platforms.
>> + * In that case, #define RESOLUTION as ULL to get 64 bit result:
>> + * #define RESOLUTION 1024ULL
>> + *
>> + * The default values do not overflow.
>> + */
>> +#define BUCKETS 12
>> +#define INTERVALS 8
>> +#define RESOLUTION 1024
>> +#define DECAY 8
>> +#define MAX_INTERESTING 50000
>> +
>> +
>> +/*
>> + * Concepts and ideas behind the power aware scheduler
>> + *
>> + * For the power aware scheduler, there are 3 decision factors for picking a C
>> + * state:
>> + * 1) Energy break even point
>> + * 2) Performance impact
>> + * 3) Latency tolerance (from pmqos infrastructure)
>> + * These these three factors are treated independently.
>> + *
>> + * Energy break even point
>> + * -----------------------
>> + * C state entry and exit have an energy cost, and a certain amount of time in
>> + * the  C state is required to actually break even on this cost. CPUIDLE
>> + * provides us this duration in the "target_residency" field. So all that we
>> + * need is a good prediction of how long we'll be idle. Like the traditional
>> + * governors, we start with the actual known "next timer event" time.
>> + *
>> + * Since there are other source of wakeups (interrupts for example) than
>> + * the next timer event, this estimation is rather optimistic. To get a
>> + * more realistic estimate, a correction factor is applied to the estimate,
>> + * that is based on historic behavior. For example, if in the past the actual
>> + * duration always was 50% of the next timer tick, the correction factor will
>> + * be 0.5.
>> + *
>> + * power aware scheduler uses a running average for this correction factor,
>> + * however it uses a set of factors, not just a single factor. This stems from
>> + * the realization that the ratio is dependent on the order of magnitude of the
>> + * expected duration; if we expect 500 milliseconds of idle time the likelihood of
>> + * getting an interrupt very early is much higher than if we expect 50 micro
>> + * seconds of idle time. A second independent factor that has big impact on
>> + * the actual factor is if there is (disk) IO outstanding or not.
>> + * (as a special twist, we consider every sleep longer than 50 milliseconds
>> + * as perfect; there are no power gains for sleeping longer than this)
>> + *
>> + * For these two reasons we keep an array of 12 independent factors, that gets
>> + * indexed based on the magnitude of the expected duration as well as the
>> + * "is IO outstanding" property.
>> + *
>> + * Repeatable-interval-detector
>> + * ----------------------------
>> + * There are some cases where "next timer" is a completely unusable predictor:
>> + * Those cases where the interval is fixed, for example due to hardware
>> + * interrupt mitigation, but also due to fixed transfer rate devices such as
>> + * mice.
>> + * For this, we use a different predictor: We track the duration of the last 8
>> + * intervals and if the stand deviation of these 8 intervals is below a
>> + * threshold value, we use the average of these intervals as prediction.
>> + *
>> + * Limiting Performance Impact
>> + * ---------------------------
>> + * C states, especially those with large exit latencies, can have a real
>> + * noticeable impact on workloads, which is not acceptable for most sysadmins,
>> + * and in addition, less performance has a power price of its own.
>> + *
>> + * As a general rule of thumb, power aware sched assumes that the following
>> + * heuristic holds:
>> + *     The busier the system, the less impact of C states is acceptable
>> + *
>> + * This rule-of-thumb is implemented using a performance-multiplier:
>> + * If the exit latency times the performance multiplier is longer than
>> + * the predicted duration, the C state is not considered a candidate
>> + * for selection due to a too high performance impact. So the higher
>> + * this multiplier is, the longer we need to be idle to pick a deep C
>> + * state, and thus the less likely a busy CPU will hit such a deep
>> + * C state.
>> + *
>> + * Two factors are used in determing this multiplier:
>> + * a value of 10 is added for each point of "per cpu load average" we have.
>> + * a value of 5 points is added for each process that is waiting for
>> + * IO on this CPU.
>> + * (these values are experimentally determined)
>> + *
>> + * The load average factor gives a longer term (few seconds) input to the
>> + * decision, while the iowait value gives a cpu local instantanious input.
>> + * The iowait factor may look low, but realize that this is also already
>> + * represented in the system load average.
>> + *
>> + */
>> +
>> +struct sched_cpuidle_info {
>> +	int		last_state_idx;
>> +	int             needs_update;
>> +
>> +	unsigned int	next_timer_us;
>> +	unsigned int	predicted_us;
>> +	unsigned int	bucket;
>> +	unsigned int	correction_factor[BUCKETS];
>> +	unsigned int	intervals[INTERVALS];
>> +	int		interval_ptr;
>> +};
>> +
>> +
>> +#define LOAD_INT(x) ((x) >> FSHIFT)
>> +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
>> +
>> +static int get_loadavg(void)
>> +{
>> +	unsigned long this = this_cpu_load();
>> +
>> +
>> +	return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
>> +}
>> +
>> +static inline int which_bucket(unsigned int duration)
>> +{
>> +	int bucket = 0;
>> +
>> +	/*
>> +	 * We keep two groups of stats; one with no
>> +	 * IO pending, one without.
>> +	 * This allows us to calculate
>> +	 * E(duration)|iowait
>> +	 */
>> +	if (nr_iowait_cpu(smp_processor_id()))
>> +		bucket = BUCKETS/2;
>> +
>> +	if (duration < 10)
>> +		return bucket;
>> +	if (duration < 100)
>> +		return bucket + 1;
>> +	if (duration < 1000)
>> +		return bucket + 2;
>> +	if (duration < 10000)
>> +		return bucket + 3;
>> +	if (duration < 100000)
>> +		return bucket + 4;
>> +	return bucket + 5;
>> +}
>> +
>> +/*
>> + * Return a multiplier for the exit latency that is intended
>> + * to take performance requirements into account.
>> + * The more performance critical we estimate the system
>> + * to be, the higher this multiplier, and thus the higher
>> + * the barrier to go to an expensive C state.
>> + */
>> +static inline int performance_multiplier(void)
>> +{
>> +	int mult = 1;
>> +
>> +	/* for higher loadavg, we are more reluctant */
>> +
>> +	mult += 2 * get_loadavg();
>> +
>> +	/* for IO wait tasks (per cpu!) we add 5x each */
>> +	mult += 10 * nr_iowait_cpu(smp_processor_id());
>> +
>> +	return mult;
>> +}
>> +
>> +static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
>> +
>> +static void cpuidle_sched_update(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev);
>> +
>> +/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */
>> +static u64 div_round64(u64 dividend, u32 divisor)
>> +{
>> +	return div_u64(dividend + (divisor / 2), divisor);
>> +}
>> +
>> +/*
>> + * Try detecting repeating patterns by keeping track of the last 8
>> + * intervals, and checking if the standard deviation of that set
>> + * of points is below a threshold. If it is... then use the
>> + * average of these 8 points as the estimated value.
>> + */
>> +static void get_typical_interval(struct sched_cpuidle_info *data)
>> +{
>> +	int i, divisor;
>> +	unsigned int max, thresh;
>> +	uint64_t avg, stddev;
>> +
>> +	thresh = UINT_MAX; /* Discard outliers above this value */
>> +
>> +again:
>> +
>> +	/* First calculate the average of past intervals */
>> +	max = 0;
>> +	avg = 0;
>> +	divisor = 0;
>> +	for (i = 0; i < INTERVALS; i++) {
>> +		unsigned int value = data->intervals[i];
>> +		if (value <= thresh) {
>> +			avg += value;
>> +			divisor++;
>> +			if (value > max)
>> +				max = value;
>> +		}
>> +	}
>> +	do_div(avg, divisor);
>> +
>> +	/* Then try to determine standard deviation */
>> +	stddev = 0;
>> +	for (i = 0; i < INTERVALS; i++) {
>> +		unsigned int value = data->intervals[i];
>> +		if (value <= thresh) {
>> +			int64_t diff = value - avg;
>> +			stddev += diff * diff;
>> +		}
>> +	}
>> +	do_div(stddev, divisor);
>> +	/*
>> +	 * The typical interval is obtained when standard deviation is small
>> +	 * or standard deviation is small compared to the average interval.
>> +	 *
>> +	 * int_sqrt() formal parameter type is unsigned long. When the
>> +	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
>> +	 * the resulting squared standard deviation exceeds the input domain
>> +	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
>> +	 * In such case reject the candidate average.
>> +	 *
>> +	 * Use this result only if there is no timer to wake us up sooner.
>> +	 */
>> +	if (likely(stddev <= ULONG_MAX)) {
>> +		stddev = int_sqrt(stddev);
>> +		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
>> +							|| stddev <= 20) {
>> +			if (data->next_timer_us > avg)
>> +				data->predicted_us = avg;
>> +			return;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * If we have outliers to the upside in our distribution, discard
>> +	 * those by setting the threshold to exclude these outliers, then
>> +	 * calculate the average and standard deviation again. Once we get
>> +	 * down to the bottom 3/4 of our samples, stop excluding samples.
>> +	 *
>> +	 * This can deal with workloads that have long pauses interspersed
>> +	 * with sporadic activity with a bunch of short pauses.
>> +	 */
>> +	if ((divisor * 4) <= INTERVALS * 3)
>> +		return;
>> +
>> +	thresh = max - 1;
>> +	goto again;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_select - selects the next idle state to enter
>> + * @drv: cpuidle driver containing state data
>> + * @dev: the CPU
>> + */
>> +int cpuidle_sched_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>> +	int i;
>> +	unsigned int interactivity_req;
>> +	struct timespec t;
>> +
>> +	if (data->needs_update) {
>> +		cpuidle_sched_update(drv, dev);
>> +		data->needs_update = 0;
>> +	}
>> +
>> +	data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
>> +
>> +	/* Special case when user has set very strict latency requirement */
>> +	if (unlikely(latency_req == 0))
>> +		return 0;
>> +
>> +	/* determine the expected residency time, round up */
>> +	t = ktime_to_timespec(tick_nohz_get_sleep_length());
>> +	data->next_timer_us =
>> +		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>> +
>> +
>> +	data->bucket = which_bucket(data->next_timer_us);
>> +
>> +	/*
>> +	 * Force the result of multiplication to be 64 bits even if both
>> +	 * operands are 32 bits.
>> +	 * Make sure to round up for half microseconds.
>> +	 */
>> +	data->predicted_us = div_round64((uint64_t)data->next_timer_us *
>> +					 data->correction_factor[data->bucket],
>> +					 RESOLUTION * DECAY);
>> +
>> +	get_typical_interval(data);
>> +
>> +	/*
>> +	 * Performance multiplier defines a minimum predicted idle
>> +	 * duration / latency ratio. Adjust the latency limit if
>> +	 * necessary.
>> +	 */
>> +	interactivity_req = data->predicted_us / performance_multiplier();
>> +	if (latency_req > interactivity_req)
>> +		latency_req = interactivity_req;
>> +
>> +	/*
>> +	 * We want to default to C1 (hlt), not to busy polling
>> +	 * unless the timer is happening really really soon.
>> +	 */
>> +	if (data->next_timer_us > 5 &&
>> +	    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
>> +		dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
>> +		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>> +
>> +	/*
>> +	 * Find the idle state with the lowest power while satisfying
>> +	 * our constraints.
>> +	 */
>> +	for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
>> +		struct cpuidle_state *s = &drv->states[i];
>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +		if (s->disabled || su->disable)
>> +			continue;
>> +		if (s->target_residency > data->predicted_us)
>> +			continue;
>> +		if (s->exit_latency > latency_req)
>> +			continue;
>> +
>> +		data->last_state_idx = i;
>> +	}
>> +
>> +	return data->last_state_idx;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_reflect - records that data structures need update
>> + * @dev: the CPU
>> + * @index: the index of actual entered state
>> + *
>> + * NOTE: it's important to be fast here because this operation will add to
>> + *       the overall exit latency.
>> + */
>> +void cpuidle_sched_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	data->last_state_idx = index;
>> +	if (index >= 0)
>> +		data->needs_update = 1;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_update - attempts to guess what happened after entry
>> + * @drv: cpuidle driver containing state data
>> + * @dev: the CPU
>> + */
>> +static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	int last_idx = data->last_state_idx;
>> +	struct cpuidle_state *target = &drv->states[last_idx];
>> +	unsigned int measured_us;
>> +	unsigned int new_factor;
>> +
>> +	/*
>> +	 * Try to figure out how much time passed between entry to low
>> +	 * power state and occurrence of the wakeup event.
>> +	 *
>> +	 * If the entered idle state didn't support residency measurements,
>> +	 * we are basically lost in the dark how much time passed.
>> +	 * As a compromise, assume we slept for the whole expected time.
>> +	 *
>> +	 * Any measured amount of time will include the exit latency.
>> +	 * Since we are interested in when the wakeup begun, not when it
>> +	 * was completed, we must subtract the exit latency. However, if
>> +	 * the measured amount of time is less than the exit latency,
>> +	 * assume the state was never reached and the exit latency is 0.
>> +	 */
>> +	if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
>> +		/* Use timer value as is */
>> +		measured_us = data->next_timer_us;
>> +
>> +	} else {
>> +		/* Use measured value */
>> +		measured_us = cpuidle_get_last_residency(dev);
>> +
>> +		/* Deduct exit latency */
>> +		if (measured_us > target->exit_latency)
>> +			measured_us -= target->exit_latency;
>> +
>> +		/* Make sure our coefficients do not exceed unity */
>> +		if (measured_us > data->next_timer_us)
>> +			measured_us = data->next_timer_us;
>> +	}
>> +
>> +	/* Update our correction ratio */
>> +	new_factor = data->correction_factor[data->bucket];
>> +	new_factor -= new_factor / DECAY;
>> +
>> +	if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
>> +		new_factor += RESOLUTION * measured_us / data->next_timer_us;
>> +	else
>> +		/*
>> +		 * we were idle so long that we count it as a perfect
>> +		 * prediction
>> +		 */
>> +		new_factor += RESOLUTION;
>> +
>> +	/*
>> +	 * We don't want 0 as factor; we always want at least
>> +	 * a tiny bit of estimated time. Fortunately, due to rounding,
>> +	 * new_factor will stay nonzero regardless of measured_us values
>> +	 * and the compiler can eliminate this test as long as DECAY > 1.
>> +	 */
>> +	if (DECAY == 1 && unlikely(new_factor == 0))
>> +		new_factor = 1;
>> +
>> +	data->correction_factor[data->bucket] = new_factor;
>> +
>> +	/* update the repeating-pattern data */
>> +	data->intervals[data->interval_ptr++] = measured_us;
>> +	if (data->interval_ptr >= INTERVALS)
>> +		data->interval_ptr = 0;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_enable_device - scans a CPU's states and does setup
>> + * @drv: cpuidle driver
>> + * @dev: the CPU
>> + */
>> +int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
>> +	int i;
>> +
>> +	memset(data, 0, sizeof(struct sched_cpuidle_info));
>> +
>> +	/*
>> +	 * if the correction factor is 0 (eg first time init or cpu hotplug
>> +	 * etc), we actually want to start out with a unity factor.
>> +	 */
>> +	for(i = 0; i < BUCKETS; i++)
>> +		data->correction_factor[i] = RESOLUTION * DECAY;
>> +
>> +	return 0;
>> +}
>> +
>>
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning
  2014-08-18 15:39   ` Nicolas Pitre
@ 2014-08-18 17:26     ` Preeti U Murthy
  2014-08-18 17:53       ` Nicolas Pitre
  0 siblings, 1 reply; 23+ messages in thread
From: Preeti U Murthy @ 2014-08-18 17:26 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On 08/18/2014 09:09 PM, Nicolas Pitre wrote:
> On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> 
>> As a first step towards improving the power awareness of the scheduler,
>> this patch enables a "dumb" state where all power management is turned off.
>> Whatever additionally we put into the kernel for cpu power management must
>> do better than this in terms of performance as well as powersavings.
>> This will enable us to benchmark and optimize the power aware scheduler
>> from scratch.If we are to benchmark it against the performance of the
>> existing design, we will get sufficiently distracted by the performance
>> numbers and get steered away from a sane design.
> 
> I understand your goal here, but people *will* compare performance 
> between the old and the new design anyway.  So I think it would be a 
> better approach to simply let the existing code be and create a new 
> scheduler-based governor that can be swapped with the existing ones at 
> run time.  Eventually we'll want average users to test and compare this, 
> and asking them to recompile a second kernel and reboot between them 
> might get unwieldy to many people.
> 
> And by allowing both to coexist at run time, we're making sure both the 
> old and the new code are built helping not breaking the old code.  And 
> that will also cut down on the number of #ifdefs in many places.
> 
> In other words, CONFIG_SCHED_POWER is needed to select the scheduler 
> based governor but it shouldn't force the existing code disabled.

I don't think I understand you here. So are you proposing a runtime
switch like a sysfs interface instead of a config switch? Wouldn't that
be unwise given that its a complete turnaround of the behavior kernel
after the switch? I agree that the first patch is a dummy patch, its
meant to ensure that we have *atleast* the power efficiency that this
patch brings in. Of course after that point this patch is a no-op. In
fact the subsequent patches will mitigate the effect of this.

Regards
Preeti U Murthy
> 
> 
>> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning
  2014-08-18 17:26     ` Preeti U Murthy
@ 2014-08-18 17:53       ` Nicolas Pitre
  0 siblings, 0 replies; 23+ messages in thread
From: Nicolas Pitre @ 2014-08-18 17:53 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On Mon, 18 Aug 2014, Preeti U Murthy wrote:

> On 08/18/2014 09:09 PM, Nicolas Pitre wrote:
> > On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> > 
> >> As a first step towards improving the power awareness of the scheduler,
> >> this patch enables a "dumb" state where all power management is turned off.
> >> Whatever additionally we put into the kernel for cpu power management must
> >> do better than this in terms of performance as well as powersavings.
> >> This will enable us to benchmark and optimize the power aware scheduler
> >> from scratch.If we are to benchmark it against the performance of the
> >> existing design, we will get sufficiently distracted by the performance
> >> numbers and get steered away from a sane design.
> > 
> > I understand your goal here, but people *will* compare performance 
> > between the old and the new design anyway.  So I think it would be a 
> > better approach to simply let the existing code be and create a new 
> > scheduler-based governor that can be swapped with the existing ones at 
> > run time.  Eventually we'll want average users to test and compare this, 
> > and asking them to recompile a second kernel and reboot between them 
> > might get unwieldy to many people.
> > 
> > And by allowing both to coexist at run time, we're making sure both the 
> > old and the new code are built helping not breaking the old code.  And 
> > that will also cut down on the number of #ifdefs in many places.
> > 
> > In other words, CONFIG_SCHED_POWER is needed to select the scheduler 
> > based governor but it shouldn't force the existing code disabled.
> 
> I don't think I understand you here. So are you proposing a runtime
> switch like a sysfs interface instead of a config switch?

Absolutely. 

And looking at drivers/cpuidle/sysfs.c:store_current_governor() it seems 
that the facility is there already.

> Wouldn't that be unwise given that its a complete turnaround of the 
> behavior kernel after the switch?

Oh sure.  This is like changing cpufreq governors at run time.  But 
people should know what they're playing with and that system behavior 
changes are expected.

> I agree that the first patch is a dummy patch, its
> meant to ensure that we have *atleast* the power efficiency that this
> patch brings in. Of course after that point this patch is a no-op. In
> fact the subsequent patches will mitigate the effect of this.

Still, allowing runtime switches between  the legacy governors and 
the in-scheduler governor will greatly facilitate benchmarking.

And since our goal is to surpass the legacy governors then we should set 
it as our reference mark from the start.


Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler
  2014-08-18 17:19     ` Preeti U Murthy
@ 2014-08-18 18:25       ` Nicolas Pitre
  0 siblings, 0 replies; 23+ messages in thread
From: Nicolas Pitre @ 2014-08-18 18:25 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: alex.shi, vincent.guittot, peterz, pjt, efault, rjw,
	morten.rasmussen, svaidy, arjan, mingo, len.brown, yuyang.du,
	linaro-kernel, daniel.lezcano, corbet, catalin.marinas,
	markgross, sundar.iyer, linux-kernel, dietmar.eggemann,
	Lorenzo.Pieralisi, mike.turquette, akpm, paulmck, tglx

On Mon, 18 Aug 2014, Preeti U Murthy wrote:

> On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
> > On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> > 
> >> The goal of the power aware scheduling design is to integrate all
> >> policy, metrics and averaging into the scheduler. Today the
> >> cpu power management is fragmented and hence inconsistent.
> >>
> >> As a first step towards this integration, rid the cpuidle state management
> >> of the governors. Retain only the cpuidle driver in the cpu idle
> >> susbsystem which acts as an interface between the scheduler and low
> >> level platform specific cpuidle drivers. For all decision making around
> >> selection of idle states,the cpuidle driver falls back to the scheduler.
> >>
> >> The current algorithm for idle state selection is the same as the logic used
> >> by the menu governor. However going ahead the heuristics will be tuned and
> >> improved upon with metrics better known to the scheduler.
> > 
> > I'd strongly suggest a different approach here.  Instead of copying the 
> > menu governor code and tweaking it afterwards, it would be cleaner to 
> > literally start from scratch with a new governor.  Said new governor 
> > would grow inside the scheduler with more design freedom instead of 
> > being strapped on the side.
> > 
> > By copying existing code, the chance for cruft to remain for a long time 
> > is close to 100%. We already have one copy of it, let's keep it working 
> > and start afresh instead.
> > 
> > By starting clean it is way easier to explain and justify additions to a 
> > new design than convincing ourselves about the removal of no longer 
> > needed pieces from a legacy design.
> 
> Ok. The reason I did it this way was that I did not find anything
> grossly wrong in the current cpuidle governor algorithm. Of course this
> can be improved but I did not see strong reasons to completely wipe it
> away. I see good scope to improve upon the existing algorithm with
> additional knowledge of *the idle states being mapped to scheduling
> domains*. This will in itself give us a better algorithm and does not
> mandate significant changes from the current algorithm. So I really
> don't see why we need to start from scratch.

Sure the current algorithm can be improved.  But it has its limitations 
by design.  And simply making it more topology aware wouldn't justify 
moving it into the scheduler.

What we're contemplating is something completely integrated with the 
scheduler where cpuidle and cpufreq (and eventually thermal management) 
together are part of the same "governor" to provide global decisions on 
all 
fronts.

Not only should the next wake-up event be predicted, but also the 
anticipated system load, etc.  The scheduler may know that a given CPU 
is unlikely to be used for a while and could call for the deepest 
C-state right away without waiting for the current menu heuristic to 
converge.

There is also Daniel's I/O latency tracking that could replace the menu 
governor latency guessing, the later based on heuristics that could be 
described as black magic.

And all this has to eventually be policed by a global performance/power 
concern that should weight C-states, P-states and task placement 
together and select the best combination (Morten's work).

Therefore the current menu algorithm won't do it. It simply wasn't 
designed for that.

We'll have the opportunity to discuss this further tomorrow anyway.

> The primary issue that I found was that with the goal being power aware
> scheduler we must ensure that the possibility of a governor getting
> registered with cpuidle to choose idle states no longer will exist. The
> reason being there is just *one entity who will take this decision and
> there is no option about it*. This patch intends to bring the focus to
> this specific detail.

I think there is nothing wrong with having multiple governors being 
registered.  We simply decide at runtime via sysfs which one has control 
over the low-level cpuidle drivers.


Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-08-18 18:25 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11 11:31 [RFC PATCH V2 00/19] Power Scheduler Design Preeti U Murthy
2014-08-11 11:32 ` [RFC PATCH V2 01/19] sched/power: Remove cpu idle state selection and cpu frequency tuning Preeti U Murthy
2014-08-18 15:39   ` Nicolas Pitre
2014-08-18 17:26     ` Preeti U Murthy
2014-08-18 17:53       ` Nicolas Pitre
2014-08-11 11:33 ` [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler Preeti U Murthy
2014-08-18 15:54   ` Nicolas Pitre
2014-08-18 17:19     ` Preeti U Murthy
2014-08-18 18:25       ` Nicolas Pitre
2014-08-11 11:33 ` [RFC PATCH V2 03/19] sched/idle: Enumerate idle states in scheduler topology Preeti U Murthy
2014-08-11 11:34 ` [RFC PATCH V2 04/19] sched: add sched balance policies in kernel Preeti U Murthy
2014-08-11 11:34 ` [RFC PATCH V2 05/19] sched: add sysfs interface for sched_balance_policy selection Preeti U Murthy
2014-08-11 11:35 ` [RFC PATCH V2 06/19] sched: log the cpu utilization at rq Preeti U Murthy
2014-08-11 11:35 ` [RFC PATCH V2 07/19] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Preeti U Murthy
2014-08-11 11:36 ` [RFC PATCH V2 08/19] sched: move sg/sd_lb_stats struct ahead Preeti U Murthy
2014-08-11 11:36 ` [RFC PATCH V2 09/19] sched: get rq potential maximum utilization Preeti U Murthy
2014-08-11 11:37 ` [RFC PATCH V2 10/19] sched: detect wakeup burst with rq->avg_idle Preeti U Murthy
2014-08-11 11:38 ` [RFC PATCH V2 11/19] sched: add power aware scheduling in fork/exec/wake Preeti U Murthy
2014-08-11 11:38 ` [RFC PATCH V2 12/19] sched: using avg_idle to detect bursty wakeup Preeti U Murthy
2014-08-11 11:39 ` [RFC PATCH V2 13/19] sched: packing transitory tasks in wakeup power balancing Preeti U Murthy
2014-08-11 11:39 ` [RFC PATCH V2 14/19] sched: add power/performance balance allow flag Preeti U Murthy
2014-08-11 11:40 ` [RFC PATCH V2 15/19] sched: pull all tasks from source grp and no balance for prefer_sibling Preeti U Murthy
2014-08-11 11:41 ` [RFC PATCH V2 16/19] sched: add new members of sd_lb_stats Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).