All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yuyang Du <yuyang.du@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>
Subject: Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems
Date: Tue, 15 Dec 2015 10:22:29 +0800	[thread overview]
Message-ID: <20151215022229.GF28098@intel.com> (raw)
In-Reply-To: <20151214115453.GN6357@twins.programming.kicks-ass.net>

On Mon, Dec 14, 2015 at 12:54:53PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 14, 2015 at 06:42:24AM +0800, Yuyang Du wrote:
> > > In most cases 'r' shouldn't exceed 1024 and util_sum not significantly
> > > exceed 1024*47742, but in extreme cases like spawning lots of new tasks
> > > it may potentially overflow 32 bit. Newly created tasks contribute
> > > 1024*47742 each to the rq util_sum, which means that more than ~87 new
> > > tasks on a single rq will get us in trouble I think.
> 
> > Both can workaround the issue with additional overhead. But I suspectthey
> > will end up going in the wrong direction for util_avg. The question is a
> > big util_sum (much bigger than 1024) may not be in a right range for it
> > to be used in load balancing.
> 
> Right, it being >100% doesn't make any sense. We should look at ensuring
> it saturates at 100%, or at least have it be bounded much tighter to
> that, as currently its entirely unbounded, which is quite horrible.
> 
> > The problem is that it is not so good to initiate a new task's util_avg
> > to 1024. At least, it makes much less sense than a new task's load_avg
> > being initiated to its full weight. Because the top util_avg should be
> > well bounded by 1024 - the CPU's full utilization.
> > 
> > So, maybe give the initial util_sum to an average of its cfs_rq, like:
> 
> >         cfs_rq->avg.util_sum / cfs_rq->load.weight * task->load.weight
> > 
> > And make sure that initial value's is bounded on various conditions.
> 
> That more or less results in an harmonic series, which is still very
> much unbounded.
> 
> However, I think that makes sense, but would propose doing it
> differently. That condition is generally a maximum (assuming proper
> functioning of the weight based scheduling etc..) for any one task, so
> on migrate we can hard clip to this value.
> 
> That still doesn't get rid of the harmonic series though, so we need
> more. Now we can obviously also hard clip the sum on add, which I
> suspect we'll need to do.
> 
> That laves us with a problem on remove though, at which point we can
> clip to this max if needed, but that will add a fair amount of cost to
> remove :/
> 
> Alternatively, and I still have to go look through the code, we should
> clip when we've already calculated the weight based ratio anyway,
> avoiding the cost of that extra division.
> 
> In any case, ideas, we'll have to play with I suppose.

What about this for a task? To cover the corner cases, it unexpectedly
bacomes something like this :)

Some more thoughts, (1) what if we just init the util_avg to 0? (2) there
are still loopholes where some other util_avg may exceed 1024.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2ecefd..d4e0b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2463,6 +2463,47 @@ static int dl_overflow(struct task_struct *p, int policy,
 extern void init_dl_bw(struct dl_bw *dl_b);
 
 /*
+ * With new tasks being created, their initial util_avgs are extrapolated
+ * based on the cfs_rq's current util_avg:
+ *
+ * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
+ *
+ * However, in many cases, the above util_avg does not give a desired
+ * value. Moreover, the sum of the util_avgs may be divergent, such
+ * as when the series is a harmonic series.
+ *
+ * To solve this problem, we also cap the util_avg of successive tasks to
+ * only 1/2 of the left utilization budget:
+ *
+ * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
+ *
+ * where n denotes the nth task.
+ *
+ * As such, a simplest series from the begining would be like:
+ *
+ *  task  util_avg: 512, 256, 128,  64,  32,   16,    8, ...
+ * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
+ *
+ * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
+ * if util_avg > util_avg_cap.
+ */
+static void post_init_entity_util_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = se->cfs_rq;
+	struct sched_avg *sa = &se->avg;
+	long delta = (scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+
+	if (delta > 0) {
+		sa->util_avg = cfs_rq->avg.util_avg * se.load.weight;
+		sa->util_avg /=	 (cfs_rq->avg.load_avg + 1);
+
+		if (sa->util_avg > delta)
+			sa->util_avg = delta;
+		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	}
+}
+
+/*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
  * This function will do some initial scheduler statistics housekeeping
@@ -2484,6 +2525,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	post_init_entity_util_avg(&p->se);
 #endif
 
 	rq = __task_rq_lock(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3266eb..85c393e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,8 +682,11 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
-	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	/*
+	 * At this point, util_avg won't be used in select_task_rq_fair anyway
+	 */
+	sa->util_avg = 0;
+	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 

  parent reply	other threads:[~2015-12-15 10:09 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-11 12:55 [PATCH] sched/fair: fix mul overflow on 32-bit systems Andrey Ryabinin
2015-12-11 13:25 ` Peter Zijlstra
2015-12-11 13:36   ` Peter Zijlstra
2015-12-11 14:00     ` Andrey Ryabinin
2015-12-11 17:57       ` Morten Rasmussen
2015-12-11 18:32         ` Dietmar Eggemann
2015-12-11 19:18           ` bsegall
2015-12-13 21:02             ` Yuyang Du
2015-12-14 12:32             ` Morten Rasmussen
2015-12-14 17:51               ` bsegall
2015-12-13 22:42         ` Yuyang Du
2015-12-14 11:54           ` Peter Zijlstra
2015-12-14 13:07             ` Morten Rasmussen
2015-12-14 14:20               ` Peter Zijlstra
2015-12-14 14:46                 ` Morten Rasmussen
2015-12-15  2:22             ` Yuyang Du [this message]
2015-12-15 21:56               ` Steve Muckle
2015-12-18  2:33                 ` Yuyang Du
2016-01-03 23:14                   ` Yuyang Du
2015-12-11 17:58       ` bsegall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151215022229.GF28098@intel.com \
    --to=yuyang.du@intel.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.