linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: peterz@infradead.org, mingo@kernel.org,
	linux-kernel@vger.kernel.org, matt@codeblueprint.co.uk,
	Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com, kernellwp@gmail.com,
	yuyang.du@intel.comc, umgwanakikbuti@gmail.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH 2/2 v3] sched: use load_avg for selecting idlest group
Date: Thu,  8 Dec 2016 17:56:54 +0100	[thread overview]
Message-ID: <1481216215-24651-3-git-send-email-vincent.guittot@linaro.org> (raw)
In-Reply-To: <1481216215-24651-1-git-send-email-vincent.guittot@linaro.org>

find_idlest_group() only compares the runnable_load_avg when looking for
the least loaded group. But on fork intensive use case like hackbench
where tasks blocked quickly after the fork, this can lead to selecting the
same CPU instead of other CPUs, which have similar runnable load but a
lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor. There is now 3 zones
for the runnable_load of the rq:
-[0 .. (runnable_load - imbalance)] : Select the new rq which has
significantly less runnable_load
-](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
runnable loads are close so we use load_avg to chose between the 2 rq
-[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a scaling
factor fails as soon as this_runnable_load == 0 because we always select
local rq even if min_runnable_load is only 1, which doesn't really make
sense because they are just the same. So instead of scaling factor, we use
an absolute margin for runnable_load to detect CPUs with similar
runnable_load and we keep using scaling factor for blocked load.

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is
part of the proposal made by Peter to ensure that the clock will be up to
date when the fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
+	update_rq_clock(rq);
 	post_init_entity_util_avg(&p->se);

 	activate_task(rq, p, 0);

hackbench -P -g 1

       ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
min    0.049         0.050         0.051       0,048
avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
max    0.066         0.068         0.070       0,063
stdev  +/-9%         +/-9%         +/-8%       +/-9%

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1da846b..0129fbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5405,16 +5405,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
 	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
+	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load, spare_cap, max_spare_cap;
+		unsigned long load, avg_load, runnable_load;
+		unsigned long spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5431,6 +5435,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		 * the group containing the CPU with most spare capacity.
 		 */
 		avg_load = 0;
+		runnable_load = 0;
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
@@ -5440,7 +5445,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
 
 			spare_cap = capacity_spare_wake(i, p);
 
@@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU capacity of the group */
-		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
+			this_runnable_load = runnable_load;
+			this_avg_load = avg_load;
 			this_spare = max_spare_cap;
 		} else {
-			if (avg_load < min_load) {
-				min_load = avg_load;
+			if (min_runnable_load > (runnable_load + imbalance)) {
+				/*
+				 * The runnable load is significantly smaller
+				 *  so we can pick this new cpu
+				 */
+				min_runnable_load = runnable_load;
+				min_avg_load = avg_load;
+				idlest = group;
+			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+					(100*min_avg_load > imbalance_scale*avg_load)) {
+				/*
+				 * The runnable loads are close so we take
+				 * into account blocked load through avg_load
+				 *  which is blocked + runnable load
+				 */
+				min_avg_load = avg_load;
 				idlest = group;
 			}
 
@@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		goto skip_spare;
 
 	if (this_spare > task_util(p) / 2 &&
-	    imbalance*this_spare > 100*most_spare)
+	    imbalance_scale*this_spare > 100*most_spare)
 		return NULL;
 	else if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
 skip_spare:
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest ||
+	    (min_runnable_load > (this_runnable_load + imbalance)) ||
+	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
+			(100*this_avg_load < imbalance_scale*min_avg_load)))
 		return NULL;
 	return idlest;
 }
-- 
2.7.4

  parent reply	other threads:[~2016-12-08 16:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-08 16:56 [PATCH 0/2 v3] sched: improve spread of tasks during fork Vincent Guittot
2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
2016-12-09 13:18   ` Matt Fleming
2016-12-12  6:50   ` [tip:sched/core] sched/core: Fix find_idlest_group() " tip-bot for Vincent Guittot
2016-12-08 16:56 ` Vincent Guittot [this message]
2016-12-09 13:22   ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Matt Fleming
2016-12-09 15:22   ` Peter Zijlstra
2016-12-09 16:28     ` Vincent Guittot
2016-12-12  6:51   ` [tip:sched/core] sched/core: Use " tip-bot for Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1481216215-24651-3-git-send-email-vincent.guittot@linaro.org \
    --to=vincent.guittot@linaro.org \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=kernellwp@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=umgwanakikbuti@gmail.com \
    --cc=yuyang.du@intel.comc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).