All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yuyang Du <yuyang.du@intel.com>
To: Steve Muckle <steve.muckle@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
	ying.huang@intel.com
Subject: Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems
Date: Mon, 4 Jan 2016 07:14:18 +0800	[thread overview]
Message-ID: <20160103231418.GO28098@intel.com> (raw)
In-Reply-To: <20151218023309.GL28098@intel.com>

[-- Attachment #1: Type: text/plain, Size: 5884 bytes --]

On Fri, Dec 18, 2015 at 10:33:09AM +0800, Yuyang Du wrote:

[...]

> +	long cap = (scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;

This line of code has a bug, which is fixed with the patch at the end.

I experimented with tha patch and made a contrast of the new task's init
util_avg and the task's parent cfs_rq's util_avg between v4.4-rc5 and
v4.4-rc5+patch.

The benchmark used is unixbench's spawn test, which keeps creating new
tasks and new tasks quickly exit:

./Run spawn -i 1

I collect single copy and the CPU number of copies (on a dual-core
/ four-thread SMP). The first 32000 results of one task-spawning process
respectively are drawn in the graphs attached.

The objective of this patch is generally met - new task and its cfs_rq's
util_avg are bounded pretty tight. The 2nd method to calculating the initial
task util_avg is used very frequently (most of the times) especially in
the single copy task-spawning test.

---
Subject: [PATCH] sched/fair: Initiate a new task's util avg to a bounded
 value

A new task's util_avg is set to full utilization of a CPU (100%
time running). This accelerates a new task's utilization ramp-up,
useful to boost its execution in early time. However, it may result
in (insanely) high utilization for a transient time period when a
flood of tasks are spawned. Importantly, it violates the
"fundamentally bounded" CPU utilization, and its side effect is
negative if we don't take any measure to bound it.

This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:

(1) An expected (or average) util_avg based on its cfs_rq's util_avg:

util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:

util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

where n denotes the nth task.

If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.

Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    2 ++
 kernel/sched/fair.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    1 +
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2ecefd..cd314de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2485,6 +2485,8 @@ void wake_up_new_task(struct task_struct *p)
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
+	/* Post initialize new task's util average when its cfs_rq is set */
+	post_init_entity_util_avg(&p->se);
 
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3266eb..c31a14b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,17 +682,69 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
-	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	/*
+	 * At this point, util_avg won't be used in select_task_rq_fair anyway
+	 */
+	sa->util_avg = 0;
+	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
+/*
+ * With new tasks being created, their initial util_avgs are extrapolated
+ * based on the cfs_rq's current util_avg:
+ *
+ * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
+ *
+ * However, in many cases, the above util_avg does not give a desired
+ * value. Moreover, the sum of the util_avgs may be divergent, such
+ * as when the series is a harmonic series.
+ *
+ * To solve this problem, we also cap the util_avg of successive tasks to
+ * only 1/2 of the left utilization budget:
+ *
+ * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
+ *
+ * where n denotes the nth task.
+ *
+ * For example, a simplest series from the beginning would be like:
+ *
+ *  task  util_avg: 512, 256, 128,  64,  32,   16,    8, ...
+ * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
+ *
+ * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
+ * if util_avg > util_avg_cap.
+ */
+void post_init_entity_util_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct sched_avg *sa = &se->avg;
+	long cap = (long)(scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+
+	if (cap > 0) {
+		if (cfs_rq->avg.util_avg != 0) {
+			sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
+			sa->util_avg /=	 (cfs_rq->avg.load_avg + 1);
+
+			if (sa->util_avg > cap)
+				sa->util_avg = cap;
+		}
+		else {
+			sa->util_avg = cap;
+		}
+		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	}
+}
+
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
 #else
 void init_entity_runnable_average(struct sched_entity *se)
 {
 }
+void post_init_entity_util_avg(struct sched_entity *se)
+{
+}
 #endif
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..7049ee9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1277,6 +1277,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);
+extern void post_init_entity_util_avg(struct sched_entity *se);
 
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
-- 


[-- Attachment #2: 4.4-rc5.multiple.jpg --]
[-- Type: image/jpeg, Size: 54354 bytes --]

[-- Attachment #3: 4.4-rc5+patch.multiple.jpg --]
[-- Type: image/jpeg, Size: 53771 bytes --]

[-- Attachment #4: 4.4-rc5+patch.single.jpg --]
[-- Type: image/jpeg, Size: 113737 bytes --]

[-- Attachment #5: 4.4-rc5.single.jpg --]
[-- Type: image/jpeg, Size: 58950 bytes --]

  reply	other threads:[~2016-01-04  7:00 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-11 12:55 [PATCH] sched/fair: fix mul overflow on 32-bit systems Andrey Ryabinin
2015-12-11 13:25 ` Peter Zijlstra
2015-12-11 13:36   ` Peter Zijlstra
2015-12-11 14:00     ` Andrey Ryabinin
2015-12-11 17:57       ` Morten Rasmussen
2015-12-11 18:32         ` Dietmar Eggemann
2015-12-11 19:18           ` bsegall
2015-12-13 21:02             ` Yuyang Du
2015-12-14 12:32             ` Morten Rasmussen
2015-12-14 17:51               ` bsegall
2015-12-13 22:42         ` Yuyang Du
2015-12-14 11:54           ` Peter Zijlstra
2015-12-14 13:07             ` Morten Rasmussen
2015-12-14 14:20               ` Peter Zijlstra
2015-12-14 14:46                 ` Morten Rasmussen
2015-12-15  2:22             ` Yuyang Du
2015-12-15 21:56               ` Steve Muckle
2015-12-18  2:33                 ` Yuyang Du
2016-01-03 23:14                   ` Yuyang Du [this message]
2015-12-11 17:58       ` bsegall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160103231418.GO28098@intel.com \
    --to=yuyang.du@intel.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=steve.muckle@linaro.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.