From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A004ACA9EA9 for ; Fri, 18 Oct 2019 16:03:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 761BA21897 for ; Fri, 18 Oct 2019 16:03:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2502333AbfJRQDh (ORCPT ); Fri, 18 Oct 2019 12:03:37 -0400 Received: from [217.140.110.172] ([217.140.110.172]:44224 "EHLO foss.arm.com" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S2408951AbfJRQDg (ORCPT ); Fri, 18 Oct 2019 12:03:36 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D3C00CA2; Fri, 18 Oct 2019 09:03:12 -0700 (PDT) Received: from [10.1.195.43] (e107049-lin.cambridge.arm.com [10.1.195.43]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 547153F718; Fri, 18 Oct 2019 09:03:11 -0700 (PDT) Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware To: Vincent Guittot Cc: Peter Zijlstra , linux-kernel , "open list:THERMAL" , Ingo Molnar , "Rafael J. Wysocki" , viresh kumar , Juri Lelli , Dietmar Eggemann , Quentin Perret , Patrick Bellasi , dh.han@samsung.com References: <20191011134500.235736-1-douglas.raillard@arm.com> <20191014145315.GZ2311@hirez.programming.kicks-ass.net> <20191017095015.GI2311@hirez.programming.kicks-ass.net> <7edb1b73-54e7-5729-db5d-6b3b1b616064@arm.com> <20191017190708.GF22902@worktop.programming.kicks-ass.net> <0b807cb3-6a88-1138-dc66-9a32d9bba7ea@arm.com> <20191018120719.GH2328@hirez.programming.kicks-ass.net> <32d07c51-847d-9d51-480c-c8836f1aedc7@arm.com> From: Douglas Raillard Organization: ARM Message-ID: <02e55a7f-8122-3745-a5c0-d46cd8450f17@arm.com> Date: Fri, 18 Oct 2019 17:03:10 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB-large Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/18/19 4:15 PM, Vincent Guittot wrote: > On Fri, 18 Oct 2019 at 16:44, Douglas Raillard wrote: >> >> >> >> On 10/18/19 1:07 PM, Peter Zijlstra wrote: >>> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: >>> >>>>> What I don't see is how that that difference makes sense as input to: >>>>> >>>>> cost(x) : (1 + x) * cost_j >>>> >>>> The actual input is: >>>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) >>>> >>>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 >>>> is not directly reflected in the code but is important for units >>>> consistency. >>> >>> But completely irrelevant for the actual math and conceptual >>> understanding. >> >> > how that that difference makes sense as input to >> I was unsure if you referred to the units being inconsistent or the >> actual way of computing values being strange, so I provided some >> justification for both. >> >>> Just because computers suck at real numbers, and floats >>> are expensive, doesn't mean we have to burden ourselves with fixed point >>> when writing equations. >>> >>> Also, as a physicist I'm prone to normalizing everything to 1, because >>> that's lazy. >>> >>>>> I suppose that limits the additional OPP to twice the previously >>>>> selected cost / efficiency (see the confusion from that other email). >>>>> But given that efficency drops (or costs rise) for higher OPPs that >>>>> still doesn't really make sense.. >>> >>>> Yes, this current limit to +100% freq boosting is somehow arbitrary and >>>> could probably benefit from being tunable in some way (Kconfig option >>>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost >>>> than the one strictly required, which is expected. The goal is to speed >>>> things up at the expense of more power consumed to achieve the same work, >>>> hence at a lower efficiency (== higher cost). >>> >>> No, no Kconfig knobs. >>> >>>> That's the main reason why this boosting apply a margin on the cost of the >>>> selected OPP rather than just inflating the util. This allows controlling >>>> directly how much more power (battery life) we are going to spend to achieve >>>> some work that we know could be achieved with less power. >>> >>> But you're not; the margin is relative to the OPP, it is not absolute. >> >> Considering a CPU with 1024 max capacity (since we are not talking about >> migrations here, we can ignore CPU invariance): >> >> work = normalized number of iterations of a given busy loop >> # Thanks to freq invariance >> work = util (between 0 and 1) >> util = f/f_max >> >> # f(work) is the min freq that is admissible for "work", which we will >> # abbreviate as "f" >> f(work) = work * f_max >> >> # from struct em_cap_state doc in energy_model.h >> cost(f) = power(f) * f_max / f >> cost(f) = power(f) / util >> cost(f) = power(f) / work >> power(f) = cost(f) * work >> >> boosted_cost(f) = cost(f) + x > > In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be > boosted_cost(f)=cost(f)1+ cost(f)*x Good point, this means that we need to change "x" in these equations: x = cost(f) * margin Which leads to: lost_battery_percent(work) = (100 * T / cost(f_max) / total_battery_energy) * cost'(work) * margin * work lost_battery_percent(work) is still proportional to something that can easily be traced and averaged (cost'(work,t) * margin(work,t)). At the end of the day, since the impact depends on whether the workload will make the condition to trigger, tracing is necessary to see how it performs. Other than that, I agree that the thing becomes simpler if em_pd_get_higher_freq() takes an absolute margin (as a per-1024 of max cost) rather than something proportional to cost(f). I'll make the change for v4. >> boosted_power(f) = boosted_cost(f) * work >> boosted_power(f) = (cost(f) + x) * work >> >> # Let's normalize cost() so we can forget about f and deal only with work. >> cost'(work) = cost(f)/cost(f_max) >> x' = x/cost(f_max) >> boosted_power'(work) = (cost'(work) + x') * work >> boosted_power'(work) = cost'(work) * work + x' * work >> boosted_power'(work) = power'(work) + x' * work >> boosted_power'(work) = power'(work) + A(work) >> >> # Over a duration T, spend an extra B unit of energy >> B(work) = A(work) * T >> lost_battery_percent(work) = 100 * B(work)/total_battery_energy >> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy >> lost_battery_percent(work) = >> (100 * T / cost(f_max) / total_battery_energy) * x * work >> >> This means that the effect of boosting on battery life is proportional >> to "x" unless I made a mistake somewhere. >> >>> >>> Or rather, the only actual limit is in relation to the max OPP. So you >>> have very little actual control over how much more energy you're >>> spending. >>> >>>>> So while I agree that 2) is a reasonable signal to work from, everything >>>>> that comes after is still much confusing me. >>> >>>> "When applying these boosting rules on the runqueue util signals ...": >>>> Assuming the set of enqueued tasks stays the same between 2 observations >>>> from schedutil, if we see the rq util_avg increase above its >>>> util_est.enqueued, that means that at least one task had its util_avg go >>>> above util_est.enqueued. We might miss some boosting opportunities if some >>>> (util - util_est) compensates: >>>> TASK_1(util - util_est) = - TASK_2(util - util_est) >>>> but working on the aggregated value is much easier in schedutil, to avoid >>>> crawling the list of entities. >>> >>> That still does not explain why 'util - util_est', when >0, makes for a >>> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is >>> increasing (for the aperiodic blah blah blah). But after that I'm still >>> confused. >> >> For the same reason PELT makes a sensible input for OPP selection. >> Currently, OPP selection is based on max(util_avg, util_est.enqueued) >> (from cpu_util_cfs in sched.h), so as soon as we have >> (util - util_est > 0), the OPP will be selected according to util_avg. >> In a way, using util_avg there is already some kind of boosting. >> >> Since the boosting is essentially (util - constant), it grows the same >> way as util. If we think of (util - util_est) as being some estimation >> of how wrong we were in the estimation of the task "true" utilization of >> the CPU, then it makes sense to feed that to the boost. The wronger we >> were, the more we want to boost, because the more time passes, the more >> the scheduler realizes it actually does not know what the task needs. In >> doubt, provide a higher freq than usual until we get to know this task >> better. When that happens (at the next period), boosting is disabled and >> we revert to the usual behavior (aka margin=0). >> >> Hope we are converging to some wording that makes sense.