linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Shi <alex.shi@intel.com>
To: torvalds@linux-foundation.org, mingo@redhat.com,
	peterz@infradead.org, tglx@linutronix.de,
	akpm@linux-foundation.org, arjan@linux.intel.com, bp@alien8.de,
	pjt@google.com, namhyung@kernel.org, efault@gmx.de
Cc: vincent.guittot@linaro.org, gregkh@linuxfoundation.org,
	preeti@linux.vnet.ibm.com, viresh.kumar@linaro.org,
	linux-kernel@vger.kernel.org, alex.shi@intel.com,
	morten.rasmussen@arm.com
Subject: [patch v5 14/15] sched: power aware load balance
Date: Mon, 18 Feb 2013 13:07:41 +0800	[thread overview]
Message-ID: <1361164062-20111-15-git-send-email-alex.shi@intel.com> (raw)
In-Reply-To: <1361164062-20111-1-git-send-email-alex.shi@intel.com>

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
any scheduler group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. If the domain is suitable for
power balance, but the cpu is inappropriate(idle or full), stop both
power/performance balance in this domain. If using performance balance
or any group is busy, do performance balance.

Above logical is mainly implemented in update_sd_lb_power_stats(). It
decides if a domain is suitable for power aware scheduling. If so,
it will fill the dst group and source group accordingly.

This patch reuse some of Suresh's power saving load balance code.

A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 126 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffdf35d..3b1e9a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3344,6 +3344,10 @@ struct sd_lb_stats {
 	unsigned int  sd_utils;	/* sum utilizations of this domain */
 	unsigned long sd_capacity;	/* capacity of this domain */
 	struct sched_group *group_leader; /* Group which relieves group_min */
+	struct sched_group *group_min;	/* Least loaded group in sd */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned int  leader_util;	/* sum utilizations of group_leader */
+	unsigned int  min_util;		/* sum utilizations of group_min */
 };
 
 /*
@@ -4412,6 +4416,105 @@ static unsigned long task_h_load(struct task_struct *p)
 /********** Helpers for find_busiest_group ************************/
 
 /**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_util = UINT_MAX;
+	sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold, threshold_util;
+
+	if (env->perf_lb)
+		return;
+
+	if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sgs->group_weight;
+	else
+		threshold = sgs->group_capacity;
+	threshold_util = threshold * FULL_UTIL;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (!sgs->sum_nr_running ||
+		sgs->group_utils + FULL_UTIL > threshold_util))
+		env->power_lb = 0;
+
+	/* Do performance load balance if any group overload */
+	if (sgs->group_utils > threshold_util) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->group_utils < sds->min_util) ||
+	    (sgs->group_utils == sds->min_util &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_util = sgs->group_utils;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->group_utils + FULL_UTIL > threshold_util)
+		return;
+
+	if (sgs->group_utils > sds->leader_util ||
+	    (sgs->group_utils == sds->leader_util && sds->group_leader &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_util = sgs->group_utils;
+	}
+}
+
+/**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
  * @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4650,6 +4753,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+
+		/* accumulate the maximum potential util */
+		if (!nr_running)
+			nr_running = 1;
+		sgs->group_utils += rq->util * nr_running;
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4758,6 +4867,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4809,6 +4919,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -5026,6 +5137,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -5203,8 +5327,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
-		.power_lb	= 0,
-		.perf_lb	= 1,
+		.power_lb	= 1,
+		.perf_lb	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -6282,7 +6406,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
-- 
1.7.12


  parent reply	other threads:[~2013-02-18  5:08 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
2013-02-18  8:28   ` Joonsoo Kim
2013-02-18  9:16     ` Alex Shi
2013-02-18  5:07 ` [patch v5 02/15] sched: set initial load avg of new forked task Alex Shi
2013-02-20  6:20   ` Alex Shi
2013-02-24 10:57     ` Preeti U Murthy
2013-02-25  6:00       ` Alex Shi
2013-02-28  7:03         ` Preeti U Murthy
2013-02-25  7:12       ` Alex Shi
2013-02-18  5:07 ` [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-02-18  5:07 ` [patch v5 04/15] sched: add sched balance policies in kernel Alex Shi
2013-02-20  9:37   ` Ingo Molnar
2013-02-20 13:40     ` Alex Shi
2013-02-20 15:41       ` Ingo Molnar
2013-02-21  1:43         ` Alex Shi
2013-02-18  5:07 ` [patch v5 05/15] sched: add sysfs interface for sched_balance_policy selection Alex Shi
2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
2013-02-20  9:30   ` Peter Zijlstra
2013-02-20 12:09     ` Preeti U Murthy
2013-02-20 13:34       ` Peter Zijlstra
2013-02-20 14:36         ` Alex Shi
2013-02-20 14:33     ` Alex Shi
2013-02-20 15:20       ` Peter Zijlstra
2013-02-21  1:35         ` Alex Shi
2013-02-20 15:22       ` Peter Zijlstra
2013-02-25  2:26         ` Alex Shi
2013-03-22  8:49         ` Alex Shi
2013-02-20 12:19   ` Preeti U Murthy
2013-02-20 12:39     ` Alex Shi
2013-02-18  5:07 ` [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Alex Shi
2013-02-20  9:38   ` Peter Zijlstra
2013-02-20 12:27     ` Alex Shi
2013-02-18  5:07 ` [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead Alex Shi
2013-02-18  5:07 ` [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-02-20  9:42   ` Peter Zijlstra
2013-02-20 12:09     ` Alex Shi
2013-02-20 13:36       ` Peter Zijlstra
2013-02-20 14:23         ` Alex Shi
2013-02-21 13:33           ` Peter Zijlstra
2013-02-21 14:40             ` Alex Shi
2013-02-22  8:54               ` Peter Zijlstra
2013-02-24  9:27                 ` Alex Shi
2013-02-24  9:49                   ` Preeti U Murthy
2013-02-24 11:55                     ` Alex Shi
2013-02-24 17:51                   ` Preeti U Murthy
2013-02-25  2:23                     ` Alex Shi
2013-02-25  3:23                       ` Mike Galbraith
2013-02-25  9:53                         ` Alex Shi
2013-02-25 10:30                           ` Mike Galbraith
2013-02-18  5:07 ` [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing Alex Shi
2013-02-18  8:44   ` Joonsoo Kim
2013-02-18  8:56     ` Alex Shi
2013-02-20  5:55       ` Alex Shi
2013-02-20  7:40         ` Mike Galbraith
2013-02-20  8:11           ` Alex Shi
2013-02-20  8:43             ` Mike Galbraith
2013-02-20  8:54               ` Alex Shi
2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
2013-02-20  9:48   ` Peter Zijlstra
2013-02-20 12:04     ` Alex Shi
2013-02-20 13:37       ` Peter Zijlstra
2013-02-20 13:48         ` Peter Zijlstra
2013-02-20 14:08           ` Alex Shi
2013-02-20 13:52         ` Alex Shi
2013-02-20 12:12   ` Borislav Petkov
2013-02-20 14:20     ` Alex Shi
2013-02-20 15:22       ` Borislav Petkov
2013-02-21  1:32         ` Alex Shi
2013-02-21  9:42           ` Borislav Petkov
2013-02-21 14:52             ` Alex Shi
2013-02-18  5:07 ` [patch v5 12/15] sched: pull all tasks from source group Alex Shi
2013-02-18  5:07 ` [patch v5 13/15] sched: no balance for prefer_sibling in power scheduling Alex Shi
2013-02-18  5:07 ` Alex Shi [this message]
2013-03-20  4:57   ` [patch v5 14/15] sched: power aware load balance Preeti U Murthy
2013-03-21  7:43     ` Alex Shi
2013-03-21  8:41       ` Preeti U Murthy
2013-03-21  9:27         ` Alex Shi
2013-03-21 10:27           ` Preeti U Murthy
2013-03-22  1:30             ` Alex Shi
2013-03-22  5:14               ` Preeti U Murthy
2013-03-25  4:52                 ` Alex Shi
2013-03-29 12:42                   ` Preeti U Murthy
2013-03-29 13:39                     ` Alex Shi
2013-03-30 11:25                       ` Preeti U Murthy
2013-03-30 14:04                         ` Alex Shi
2013-03-30 15:31                           ` Preeti U Murthy
2013-02-18  5:07 ` [patch v5 15/15] sched: lazy power balance Alex Shi
2013-02-18  7:44 ` [patch v5 0/15] power aware scheduling Alex Shi
2013-02-19 12:08 ` Paul Turner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1361164062-20111-15-git-send-email-alex.shi@intel.com \
    --to=alex.shi@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=efault@gmx.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).