From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFBDDC433EA for ; Fri, 17 Jul 2020 09:55:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D1CF52070E for ; Fri, 17 Jul 2020 09:55:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726059AbgGQJzp (ORCPT ); Fri, 17 Jul 2020 05:55:45 -0400 Received: from foss.arm.com ([217.140.110.172]:46290 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725864AbgGQJzl (ORCPT ); Fri, 17 Jul 2020 05:55:41 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 26980D6E; Fri, 17 Jul 2020 02:55:41 -0700 (PDT) Received: from [10.37.12.35] (unknown [10.37.12.35]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 429B43F66E; Fri, 17 Jul 2020 02:55:38 -0700 (PDT) Subject: Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util() To: Peter Zijlstra Cc: Viresh Kumar , Ingo Molnar , Vincent Guittot , Zhang Rui , Daniel Lezcano , Amit Daniel Kachhap , Javi Merino , Amit Kucheria , linux-kernel@vger.kernel.org, Quentin Perret , Rafael Wysocki , linux-pm@vger.kernel.org References: <20200716115605.GR10769@hirez.programming.kicks-ass.net> <681fb3e8-d645-2558-38de-b39b372499de@arm.com> <20200716154335.GT10769@hirez.programming.kicks-ass.net> From: Lukasz Luba Message-ID: Date: Fri, 17 Jul 2020 10:55:36 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20200716154335.GT10769@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org On 7/16/20 4:43 PM, Peter Zijlstra wrote: > On Thu, Jul 16, 2020 at 03:24:37PM +0100, Lukasz Luba wrote: >> On 7/16/20 12:56 PM, Peter Zijlstra wrote: > >>> The second attempts to guesstimate power, and is the subject of this >>> patch. >>> >>> Currently cpufreq_cooling appears to estimate the CPU energy usage by >>> calculating the percentage of idle time using the per-cpu cpustat stuff, >>> which is pretty horrific. >> >> Even worse, it then *samples* the *current* CPU frequency at that >> particular point in time and assumes that when the CPU wasn't idle >> during that period - it had *this* frequency... > > *whee* :-) > > ... > >> In EM we keep power values in the array and these values grow >> exponentially. Each OPP has it corresponding >> >> P_x = C (V_x)^2 f_x , where x is the OPP id thus corresponding V,f >> >> so we have discrete power values, growing like: >> >> ^(power) >> | >> | >> | * >> | >> | >> | * >> | | >> | * | >> | | <----- power estimation function >> | * | should not use linear 'util/max_util' >> | * | relation here * >> |_______________________|_____________> (freq) >> opp0 opp1 opp2 opp3 opp4 >> >> What is the problem >> First: >> We need to pick the right Power from the array. I would suggest >> to pick the max allowed frequency for that whole period, because >> we don't know if the CPUs were using it (it's likely). >> Second: >> Then we have the utilization, which can be considered as: >> 'idle period & running period with various freq inside', lets >> call it avg performance in that whole period. >> Third: >> Try to estimate the power used in that whole period having >> the avg performance and max performance. >> >> What you are suggesting is to travel that [*] line in >> non-linear fashion, but in (util^3)/(max_util^3). Which means >> it goes down faster when the utilization drops. >> I think it is too aggressive, e.g. >> 500^3 / 1024^3 = 0.116 <--- very little, ~12% >> 200^3 / 300^3 = 0.296 >> >> Peter could you confirm if I understood you correct? > > Correct, with the caveat that we might try and regression fit a 3rd > order polynomial to a bunch of EM data to see if there's a 'better' > function to be had than a raw 'f(x) := x^3'. I agree, I think we are on the same wavelength now. > >> This is quite important bit for me. > > So, if we assume schedutil + EM, we can actually have schedutil > calculate a running power sum. That is, something like: \Int P_x dt. > Because we know the points where OPP changes. Yes, that's why I was thinking about having this information stored as a copy inside the EM, then just read it in other subsystem like: thermal, powercap, etc. > > Although, thinking more, I suspect we need tighter integration with > cpuidle, because we don't actually have idle times here, but that should > be doable. I am scratching my head for while because of that idle issue. It opens more dimensions to tackle. > > But for anything other than schedutil + EM, things become more > interesting, because then we need to guesstimate power usage without the > benefit of having actual power numbers. Yes, from the engineering/research perspective, platforms which do not have EM in Linux (like Intel) are also interesting. > > We can of course still do that running power sum, with whatever P(u) or > P(f) end up with, I suppose. > >>> Another point is that cpu_util() vs turbo is a bit iffy, and to that, >>> things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also >>> have the benefit of giving you values that match your own sampling >>> interval (100ms), where the sched stuff is PELT (64,32.. based). >>> >>> So what I've been thinking is that cpufreq drivers ought to be able to >>> supply this method, and only when they lack, can the cpufreq-governor >>> (schedutil) install a fallback. And then cpufreq-cooling can use >>> whatever is provided (through the cpufreq interfaces). >>> >>> That way, we: >>> >>> 1) don't have to export anything >>> 2) get arch drivers to provide something 'better' >>> >>> >>> Does that sounds like something sensible? >>> >> >> Yes, make sense. Please also keep in mind that this >> utilization somehow must be mapped into power in a proper way. >> I am currently working on addressing all of these problems >> (including this correlation). > > Right, so that mapping util to power was what I was missing and > suggesting we do. So for 'simple' hardware we have cpufreq events for > frequency change, and cpuidle events for idle, and with EM we can simply > sum the relevant power numbers. > > For hardware lacking EM, or hardware managed DVFS, we'll have to fudge > things a little. How best to do that is up in the air a little, but > virtual power curves seem a useful tool to me. > > The next problem for IPA is having all the devices report power in the > same virtual unit I suppose, but I'll leave that to others ;-) > True, there is more issues. There is also another movement with powercap driven by Daniel Lezcano, which I am going to support. Maybe he would be interested as well in having a copy of calculated energy stored in EM. I must gather some requirements and align with him. Thank you for your support! Regards, Lukasz