From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936247AbdJRMpj (ORCPT ); Wed, 18 Oct 2017 08:45:39 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:39656 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934902AbdJRMpd (ORCPT ); Wed, 18 Oct 2017 08:45:33 -0400 Date: Wed, 18 Oct 2017 13:45:25 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org, josef@toxicpanda.com, torvalds@linux-foundation.org, vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com, clm@fb.com, dietmar.eggemann@arm.com, bsegall@google.com, yuyang.du@intel.com Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation Message-ID: <20171018124523.GA27508@e105550-lin.cambridge.arm.com> References: <20170901132059.342024223@infradead.org> <20170901132748.580255511@infradead.org> <20171009080856.GB24129@e105550-lin.cambridge.arm.com> <20171009094517.wfovc5b2dpbazta4@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171009094517.wfovc5b2dpbazta4@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 09, 2017 at 11:45:17AM +0200, Peter Zijlstra wrote: > On Mon, Oct 09, 2017 at 09:08:57AM +0100, Morten Rasmussen wrote: > > > --- a/kernel/sched/debug.c > > > +++ b/kernel/sched/debug.c > > > @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in > > > cfs_rq->removed.load_avg); > > > SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg", > > > cfs_rq->removed.util_avg); > > > + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum", > > > + cfs_rq->removed.runnable_sum); > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib", > > > cfs_rq->tg_load_avg_contrib); > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit > > > se->avg.last_update_time = n_last_update_time; > > > } > > > > > > -/* Take into account change of utilization of a child task group */ > > > + > > > +/* > > > + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to > > > + * propagate its contribution. The key to this propagation is the invariant > > > + * that for each group: > > > + * > > > + * ge->avg == grq->avg (1) > > > + * > > > + * _IFF_ we look at the pure running and runnable sums. Because they > > > + * represent the very same entity, just at different points in the hierarchy. > > > + * > > > + * > > > + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and > > > + * simply copies the running sum over. > > > + * > > > + * However, update_tg_cfs_runnable() is more complex. So we have: > > > + * > > > + * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2) > > > + * > > > + * And since, like util, the runnable part should be directly transferable, > > > + * the following would _appear_ to be the straight forward approach: > > > + * > > > + * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3) > > > > Should it be grq->avg.runnable_avg instead of running_avg? > > Yes very much so. Typing hard. Otherwise (3) would not follow from (2) > either. > > > cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to > > be: > > > > load_avg = \Sum se->avg.load_avg > > = \Sum se->load.weight * se->avg.runnable_avg > > > > That sum will increase when ge is runnable regardless of whether it is > > running or not. So, I think it has to be runnable_avg to make sense? > > Ack. > > > > + * > > > + * And per (1) we have: > > > + * > > > + * ge->avg.running_avg == grq->avg.running_avg > > > > You just said further up that (1) only applies to running and runnable > > sums? These are averages, so I think this is invalid use of (1). But > > maybe that is part of your point about (4) being wrong? > > > > I'm still trying to get my head around the remaining bits, but it sort > > of depends if I understood the above bits correctly :) > > So while true, the thing we're looking for is indeed runnable_avg. > > > > + * > > > + * Which gives: > > > + * > > > + * ge->load.weight * grq->avg.load_avg > > > + * ge->avg.load_avg = ----------------------------------- (4) > > > + * grq->load.weight > > > + * > > > + * Except that is wrong! > > > + * > > > + * Because while for entities historical weight is not important and we > > > + * really only care about our future and therefore can consider a pure > > > + * runnable sum, runqueues can NOT do this. > > > + * > > > + * We specifically want runqueues to have a load_avg that includes > > > + * historical weights. Those represent the blocked load, the load we expect > > > + * to (shortly) return to us. This only works by keeping the weights as > > > + * integral part of the sum. We therefore cannot decompose as per (3). > > > + * > > > + * OK, so what then? > > And as the text above suggests, we cannot decompose because it contains > the blocked weight, which is not included in grq->load.weight and thus > things come apart. > > > > + * Another way to look at things is: > > > + * > > > + * grq->avg.load_avg = \Sum se->avg.load_avg > > > + * > > > + * Therefore, per (2): > > > + * > > > + * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg > > > + * > > > + * And the very thing we're propagating is a change in that sum (someone > > > + * joined/left). So we can easily know the runnable change, which would be, per > > > + * (2) the already tracked se->load_avg divided by the corresponding > > > + * se->weight. > > > + * > > > + * Basically (4) but in differential form: > > > + * > > > + * d(runnable_avg) += se->avg.load_avg / se->load.weight > > > + * (5) > > > + * ge->avg.load_avg += ge->load.weight * d(runnable_avg) > > And this all has runnable again, and so should make sense. I'm afraid I don't quite get why (5) is correct. It might be related to the issues Vincent already pointed out. d(runnable_avg) is the runnable_avg series for the joining/leaving se which is contributing to grq->avg.load_avg, but I don't see how you can use that to compute the impact on ge->avg.load_avg. ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2) In (5) you have just substituted ge->avg.runnable_avg with d(runnable_avg) in (2). However, the relationship between ge->avg.runnable_avg and se->avg.runnable_avg is complicated. ge is runnable whenever se is, but the reverse isn't necessarily true. Let's say you have two always-runnable tasks on your grq and one of the leaves (migrates away). In that case, ge->avg.runnable_avg is equal to se->avg.runnable_avg (both always-runnable) which is d(runnable_avg), so in (5) we end up with: ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg - ge->load.weight * se->avg.runnable_avg = 0 But you still have one always-running task on the grq so clearly it shouldn't be zero. IOW, AFAICT, it is not possible to decompose ge->avg.runnable_avg into contributions from each individual se on the grq. At least not without some additional assumptions. What am I missing? Morten