From: Douglas Raillard <firstname.lastname@example.org> To: Vincent Guittot <email@example.com> Cc: Peter Zijlstra <firstname.lastname@example.org>, linux-kernel <email@example.com>, "open list:THERMAL" <firstname.lastname@example.org>, Ingo Molnar <email@example.com>, "Rafael J. Wysocki" <firstname.lastname@example.org>, viresh kumar <email@example.com>, Juri Lelli <firstname.lastname@example.org>, Dietmar Eggemann <email@example.com>, Quentin Perret <firstname.lastname@example.org>, Patrick Bellasi <email@example.com>, firstname.lastname@example.org Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware Date: Fri, 18 Oct 2019 17:03:10 +0100 Message-ID: <email@example.com> (raw) In-Reply-To: <CAKfTPtATv+TaLus3ggijLWf0KAkexHgpHOTq++iqxaB4jeofirstname.lastname@example.org> On 10/18/19 4:15 PM, Vincent Guittot wrote: > On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <email@example.com> wrote: >> >> >> >> On 10/18/19 1:07 PM, Peter Zijlstra wrote: >>> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: >>> >>>>> What I don't see is how that that difference makes sense as input to: >>>>> >>>>> cost(x) : (1 + x) * cost_j >>>> >>>> The actual input is: >>>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) >>>> >>>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 >>>> is not directly reflected in the code but is important for units >>>> consistency. >>> >>> But completely irrelevant for the actual math and conceptual >>> understanding. >> >> > how that that difference makes sense as input to >> I was unsure if you referred to the units being inconsistent or the >> actual way of computing values being strange, so I provided some >> justification for both. >> >>> Just because computers suck at real numbers, and floats >>> are expensive, doesn't mean we have to burden ourselves with fixed point >>> when writing equations. >>> >>> Also, as a physicist I'm prone to normalizing everything to 1, because >>> that's lazy. >>> >>>>> I suppose that limits the additional OPP to twice the previously >>>>> selected cost / efficiency (see the confusion from that other email). >>>>> But given that efficency drops (or costs rise) for higher OPPs that >>>>> still doesn't really make sense.. >>> >>>> Yes, this current limit to +100% freq boosting is somehow arbitrary and >>>> could probably benefit from being tunable in some way (Kconfig option >>>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost >>>> than the one strictly required, which is expected. The goal is to speed >>>> things up at the expense of more power consumed to achieve the same work, >>>> hence at a lower efficiency (== higher cost). >>> >>> No, no Kconfig knobs. >>> >>>> That's the main reason why this boosting apply a margin on the cost of the >>>> selected OPP rather than just inflating the util. This allows controlling >>>> directly how much more power (battery life) we are going to spend to achieve >>>> some work that we know could be achieved with less power. >>> >>> But you're not; the margin is relative to the OPP, it is not absolute. >> >> Considering a CPU with 1024 max capacity (since we are not talking about >> migrations here, we can ignore CPU invariance): >> >> work = normalized number of iterations of a given busy loop >> # Thanks to freq invariance >> work = util (between 0 and 1) >> util = f/f_max >> >> # f(work) is the min freq that is admissible for "work", which we will >> # abbreviate as "f" >> f(work) = work * f_max >> >> # from struct em_cap_state doc in energy_model.h >> cost(f) = power(f) * f_max / f >> cost(f) = power(f) / util >> cost(f) = power(f) / work >> power(f) = cost(f) * work >> >> boosted_cost(f) = cost(f) + x > > In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be > boosted_cost(f)=cost(f)1+ cost(f)*x Good point, this means that we need to change "x" in these equations: x = cost(f) * margin Which leads to: lost_battery_percent(work) = (100 * T / cost(f_max) / total_battery_energy) * cost'(work) * margin * work lost_battery_percent(work) is still proportional to something that can easily be traced and averaged (cost'(work,t) * margin(work,t)). At the end of the day, since the impact depends on whether the workload will make the condition to trigger, tracing is necessary to see how it performs. Other than that, I agree that the thing becomes simpler if em_pd_get_higher_freq() takes an absolute margin (as a per-1024 of max cost) rather than something proportional to cost(f). I'll make the change for v4. >> boosted_power(f) = boosted_cost(f) * work >> boosted_power(f) = (cost(f) + x) * work >> >> # Let's normalize cost() so we can forget about f and deal only with work. >> cost'(work) = cost(f)/cost(f_max) >> x' = x/cost(f_max) >> boosted_power'(work) = (cost'(work) + x') * work >> boosted_power'(work) = cost'(work) * work + x' * work >> boosted_power'(work) = power'(work) + x' * work >> boosted_power'(work) = power'(work) + A(work) >> >> # Over a duration T, spend an extra B unit of energy >> B(work) = A(work) * T >> lost_battery_percent(work) = 100 * B(work)/total_battery_energy >> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy >> lost_battery_percent(work) = >> (100 * T / cost(f_max) / total_battery_energy) * x * work >> >> This means that the effect of boosting on battery life is proportional >> to "x" unless I made a mistake somewhere. >> >>> >>> Or rather, the only actual limit is in relation to the max OPP. So you >>> have very little actual control over how much more energy you're >>> spending. >>> >>>>> So while I agree that 2) is a reasonable signal to work from, everything >>>>> that comes after is still much confusing me. >>> >>>> "When applying these boosting rules on the runqueue util signals ...": >>>> Assuming the set of enqueued tasks stays the same between 2 observations >>>> from schedutil, if we see the rq util_avg increase above its >>>> util_est.enqueued, that means that at least one task had its util_avg go >>>> above util_est.enqueued. We might miss some boosting opportunities if some >>>> (util - util_est) compensates: >>>> TASK_1(util - util_est) = - TASK_2(util - util_est) >>>> but working on the aggregated value is much easier in schedutil, to avoid >>>> crawling the list of entities. >>> >>> That still does not explain why 'util - util_est', when >0, makes for a >>> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is >>> increasing (for the aperiodic blah blah blah). But after that I'm still >>> confused. >> >> For the same reason PELT makes a sensible input for OPP selection. >> Currently, OPP selection is based on max(util_avg, util_est.enqueued) >> (from cpu_util_cfs in sched.h), so as soon as we have >> (util - util_est > 0), the OPP will be selected according to util_avg. >> In a way, using util_avg there is already some kind of boosting. >> >> Since the boosting is essentially (util - constant), it grows the same >> way as util. If we think of (util - util_est) as being some estimation >> of how wrong we were in the estimation of the task "true" utilization of >> the CPU, then it makes sense to feed that to the boost. The wronger we >> were, the more we want to boost, because the more time passes, the more >> the scheduler realizes it actually does not know what the task needs. In >> doubt, provide a higher freq than usual until we get to know this task >> better. When that happens (at the next period), boosting is disabled and >> we revert to the usual behavior (aka margin=0). >> >> Hope we are converging to some wording that makes sense.
next prev parent reply index Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-10-11 13:44 Douglas RAILLARD 2019-10-11 13:44 ` [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD 2019-10-17 8:57 ` Dietmar Eggemann 2019-10-17 9:58 ` Dietmar Eggemann 2019-10-17 11:09 ` Douglas Raillard 2019-10-11 13:44 ` [RFC PATCH v3 2/6] sched/cpufreq: Attach perf domain to sugov policy Douglas RAILLARD 2019-10-17 8:57 ` Dietmar Eggemann 2019-10-17 10:22 ` Douglas Raillard 2019-10-11 13:44 ` [RFC PATCH v3 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() Douglas RAILLARD 2019-10-11 13:44 ` [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD 2019-10-14 14:33 ` Peter Zijlstra 2019-10-14 15:32 ` Douglas Raillard 2019-10-17 8:57 ` Dietmar Eggemann 2019-10-17 11:19 ` Douglas Raillard 2019-10-11 13:44 ` [RFC PATCH v3 5/6] sched/cpufreq: Boost schedutil frequency ramp up Douglas RAILLARD 2019-10-17 9:21 ` Dietmar Eggemann 2019-10-11 13:45 ` [RFC PATCH v3 6/6] sched/cpufreq: Add schedutil_em_tp tracepoint Douglas RAILLARD 2019-10-14 14:53 ` [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware Peter Zijlstra 2019-10-14 15:50 ` Douglas Raillard 2019-10-17 9:50 ` Peter Zijlstra 2019-10-17 11:11 ` Quentin Perret 2019-10-17 14:11 ` Peter Zijlstra 2019-10-18 7:44 ` Dietmar Eggemann 2019-10-18 7:59 ` Peter Zijlstra 2019-10-18 17:24 ` Douglas Raillard 2019-10-18 8:11 ` Peter Zijlstra 2019-10-17 14:23 ` Douglas Raillard 2019-10-17 14:53 ` Peter Zijlstra 2019-10-17 19:07 ` Peter Zijlstra 2019-10-18 11:46 ` Douglas Raillard 2019-10-18 12:07 ` Peter Zijlstra 2019-10-18 14:44 ` Douglas Raillard 2019-10-18 15:15 ` Vincent Guittot 2019-10-18 16:03 ` Douglas Raillard [this message] 2019-10-18 15:20 ` Vincent Guittot
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-PM Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-pm/0 linux-pm/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-pm linux-pm/ https://lore.kernel.org/linux-pm \ firstname.lastname@example.org public-inbox-index linux-pm Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pm AGPL code for this site: git clone https://public-inbox.org/public-inbox.git