From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965096AbbLOKJH (ORCPT ); Tue, 15 Dec 2015 05:09:07 -0500 Received: from mga04.intel.com ([192.55.52.120]:30131 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964951AbbLOKI6 (ORCPT ); Tue, 15 Dec 2015 05:08:58 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,431,1444719600"; d="scan'208";a="13430256" Date: Tue, 15 Dec 2015 10:22:29 +0800 From: Yuyang Du To: Peter Zijlstra Cc: Morten Rasmussen , Andrey Ryabinin , mingo@redhat.com, linux-kernel@vger.kernel.org, Paul Turner , Ben Segall Subject: Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems Message-ID: <20151215022229.GF28098@intel.com> References: <1449838518-26543-1-git-send-email-aryabinin@virtuozzo.com> <20151211132551.GO6356@twins.programming.kicks-ass.net> <20151211133612.GG6373@twins.programming.kicks-ass.net> <566AD6E1.2070005@virtuozzo.com> <20151211175751.GA27552@e105550-lin.cambridge.arm.com> <20151213224224.GC28098@intel.com> <20151214115453.GN6357@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151214115453.GN6357@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 14, 2015 at 12:54:53PM +0100, Peter Zijlstra wrote: > On Mon, Dec 14, 2015 at 06:42:24AM +0800, Yuyang Du wrote: > > > In most cases 'r' shouldn't exceed 1024 and util_sum not significantly > > > exceed 1024*47742, but in extreme cases like spawning lots of new tasks > > > it may potentially overflow 32 bit. Newly created tasks contribute > > > 1024*47742 each to the rq util_sum, which means that more than ~87 new > > > tasks on a single rq will get us in trouble I think. > > > Both can workaround the issue with additional overhead. But I suspectthey > > will end up going in the wrong direction for util_avg. The question is a > > big util_sum (much bigger than 1024) may not be in a right range for it > > to be used in load balancing. > > Right, it being >100% doesn't make any sense. We should look at ensuring > it saturates at 100%, or at least have it be bounded much tighter to > that, as currently its entirely unbounded, which is quite horrible. > > > The problem is that it is not so good to initiate a new task's util_avg > > to 1024. At least, it makes much less sense than a new task's load_avg > > being initiated to its full weight. Because the top util_avg should be > > well bounded by 1024 - the CPU's full utilization. > > > > So, maybe give the initial util_sum to an average of its cfs_rq, like: > > > cfs_rq->avg.util_sum / cfs_rq->load.weight * task->load.weight > > > > And make sure that initial value's is bounded on various conditions. > > That more or less results in an harmonic series, which is still very > much unbounded. > > However, I think that makes sense, but would propose doing it > differently. That condition is generally a maximum (assuming proper > functioning of the weight based scheduling etc..) for any one task, so > on migrate we can hard clip to this value. > > That still doesn't get rid of the harmonic series though, so we need > more. Now we can obviously also hard clip the sum on add, which I > suspect we'll need to do. > > That laves us with a problem on remove though, at which point we can > clip to this max if needed, but that will add a fair amount of cost to > remove :/ > > Alternatively, and I still have to go look through the code, we should > clip when we've already calculated the weight based ratio anyway, > avoiding the cost of that extra division. > > In any case, ideas, we'll have to play with I suppose. What about this for a task? To cover the corner cases, it unexpectedly bacomes something like this :) Some more thoughts, (1) what if we just init the util_avg to 0? (2) there are still loopholes where some other util_avg may exceed 1024. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a2ecefd..d4e0b13 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2463,6 +2463,47 @@ static int dl_overflow(struct task_struct *p, int policy, extern void init_dl_bw(struct dl_bw *dl_b); /* + * With new tasks being created, their initial util_avgs are extrapolated + * based on the cfs_rq's current util_avg: + * + * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight + * + * However, in many cases, the above util_avg does not give a desired + * value. Moreover, the sum of the util_avgs may be divergent, such + * as when the series is a harmonic series. + * + * To solve this problem, we also cap the util_avg of successive tasks to + * only 1/2 of the left utilization budget: + * + * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n + * + * where n denotes the nth task. + * + * As such, a simplest series from the begining would be like: + * + * task util_avg: 512, 256, 128, 64, 32, 16, 8, ... + * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ... + * + * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap) + * if util_avg > util_avg_cap. + */ +static void post_init_entity_util_avg(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = se->cfs_rq; + struct sched_avg *sa = &se->avg; + long delta = (scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2; + + if (delta > 0) { + sa->util_avg = cfs_rq->avg.util_avg * se.load.weight; + sa->util_avg /= (cfs_rq->avg.load_avg + 1); + + if (sa->util_avg > delta) + sa->util_avg = delta; + sa->util_sum = sa->util_avg * LOAD_AVG_MAX; + } +} + +/* * wake_up_new_task - wake up a newly created task for the first time. * * This function will do some initial scheduler statistics housekeeping @@ -2484,6 +2525,7 @@ void wake_up_new_task(struct task_struct *p) * - any previously selected cpu might disappear through hotplug */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); + post_init_entity_util_avg(&p->se); #endif rq = __task_rq_lock(p); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e3266eb..85c393e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -682,8 +682,11 @@ void init_entity_runnable_average(struct sched_entity *se) sa->period_contrib = 1023; sa->load_avg = scale_load_down(se->load.weight); sa->load_sum = sa->load_avg * LOAD_AVG_MAX; - sa->util_avg = scale_load_down(SCHED_LOAD_SCALE); - sa->util_sum = sa->util_avg * LOAD_AVG_MAX; + /* + * At this point, util_avg won't be used in select_task_rq_fair anyway + */ + sa->util_avg = 0; + sa->util_sum = 0; /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ }