From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5119C433EF for ; Tue, 26 Apr 2022 07:40:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236185AbiDZHnH (ORCPT ); Tue, 26 Apr 2022 03:43:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36014 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344211AbiDZHmy (ORCPT ); Tue, 26 Apr 2022 03:42:54 -0400 Received: from mail-yw1-x1132.google.com (mail-yw1-x1132.google.com [IPv6:2607:f8b0:4864:20::1132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A655F135662 for ; Tue, 26 Apr 2022 00:39:44 -0700 (PDT) Received: by mail-yw1-x1132.google.com with SMTP id 00721157ae682-2f7b815ac06so96884607b3.3 for ; Tue, 26 Apr 2022 00:39:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ZBtZaDgTc4PWoj0jX0RnMZyF5zFPU90KB5kgXu8bHUU=; b=VRg+sD5IYPYBVc5tZb6uvrIba7/O9kxO6qgc8tzplMaGIxF6HEHsxCQ6hurGTCrqQM GSaBjTPURi5BuCm4WoHtAumLf491qLy301fjJ5CQMDv1QJRqI7g0Fs6n9B7B4qQdxFPe I/pGevQU2XGnwuGF1Qi89s8JvbkFcTUX3PWXP/m5nxaNDxZzwSRnSEnKSG5wl4AWHWQh yJl5Ubipd293MY0+XEAAz+DgfjCM//8slu69uAYSjIcsHopA0ZKwKxPOM//QqZcUcxYf lbPE4Mt8O4ctDCBct+sZYplQVlq3DwNsCaoAb4/F0yoJnLz3i6dEvGxBXIsYVp/M6Lwl xzZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZBtZaDgTc4PWoj0jX0RnMZyF5zFPU90KB5kgXu8bHUU=; b=qnHxsiv5I6IDS7PgNcStJn1RWcRiU5WYRMsB1WLgVceDkh6xBdAtUbjtoxIq1VqIH8 nFLERXPo1HY1/HlDa2ZYQ083meh3Fp2pdP7yHE5Z4Xl12Idla9immNW2VCmxL6ho5cIo mIkCX1T6MqUPLRHXFz3Gl3+f+ORKOypDsaufwTKQvAq+Ow1p1dVgMcUfb2n1DDlfWoBl 1E+5sU2ytzAPK2HCenKyO0LWwwZdZXwgnnEk8BpdCu/plUIQWCp748dxxdWgOUOoYc7W HHNE2tW3FwWs66/EH9Pu43tkugl3qBuGxGSbHGpIEwEgpV+01l+cYbz/iq+XifbYGGk0 +lOw== X-Gm-Message-State: AOAM5310AYx3QCTesf+3T+CkJfoF0qO280s2zeA1gDJA9TxB3fHIA5+H bfnvIT6dC9rLcsgHJTfix71YL5olZVtV143GeXp+XMk5dFGL2A== X-Google-Smtp-Source: ABdhPJxrwmJGLF73kVocBYp2l/avxbC7QpHVZXNeD2Gf45pAAIZ6QgLY7jfctgJXdXguXZeHn/mysWITsmIREFIywvY= X-Received: by 2002:a0d:fa01:0:b0:2d6:595d:81d4 with SMTP id k1-20020a0dfa01000000b002d6595d81d4mr20679198ywf.86.1650958783756; Tue, 26 Apr 2022 00:39:43 -0700 (PDT) MIME-Version: 1.0 References: <20220407051932.4071-1-xuewen.yan@unisoc.com> <02350916-aa36-ea53-2c98-91b97f49d27e@arm.com> <9bb00bcb-f984-1cf6-39aa-c11dcf7f07cb@arm.com> <24631a27-42d9-229f-d9b0-040ac993b749@arm.com> In-Reply-To: From: Vincent Guittot Date: Tue, 26 Apr 2022 09:39:32 +0200 Message-ID: Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity To: Lukasz Luba Cc: Dietmar Eggemann , Xuewen Yan , rafael@kernel.org, viresh.kumar@linaro.org, mingo@redhat.com, peterz@infradead.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, di.shen@unisoc.com, Xuewen Yan Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 21 Apr 2022 at 12:57, Lukasz Luba wrote: > > > > On 4/21/22 09:29, Vincent Guittot wrote: > > On Tue, 19 Apr 2022 at 16:13, Lukasz Luba wrote: > >> > >> > >> > >> On 4/19/22 13:51, Vincent Guittot wrote: > >>> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba wrote: > >>>> > >>>> > >>>> > >>>> On 4/19/22 08:14, Vincent Guittot wrote: > >>>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan wrote: > >>>>>> > >>>>>> Hi Luba / Dietmar > >>>>>> > >>>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba wrote: > >>>>>>> > >>>>>>> > >>>>>>> [...] > >>>> To be precised and maybe fix some potential design issues. We are > >>>> talking here about utilization and set max capacity in function: > >>>> sugov_get_util() > >>>> so fields: > >>>> > >>>> sugov_cpu::util > >>>> sugov_cpu::max /* max capacity */ > >>> > >>> Yes. With this patch ,util will be lower than current thermal > >>> mitigation whereas util normally reflects what we need not what can > >>> be provided > >> > >> This is a different requirements: util has to be max capacity and > >> max capacity has to be original max CPU capacity - for the SchedUtil. > >> OK, why? What this requirement adds in the design and final values? > > > > Because the calculation you are proposing is wrong and doesn't make > > sense. Util is the average utilization of the cpu that has to be > > compared with its original capacity max in order to get the freq that > > matches with this utilization. > > > > We have freq = util / max * max_freq and cpufreq will then capp freq > > if mitigation is applied. Once the mitigation disappear, the request > > will be back to targeted freq. > > > > If you replace max by max' = max - arch_scale_thermal_pressure then : > > > > - by the time you do the calculation, arch_scale_thermal_pressure can > > have changed and the result is meaningless. This is true whatever the > > pace of updating arch_scale_thermal_pressure > > The sudden change of the value taken from arch_scale_thermal_pressure > I can understand, but there are similar and we live with it. Look at > the whole EAS estimations done in a one CPU waku-up event or the uclamp > stuff. As far this is not too frequently occurring - we live wit it. > > I can see your concern here, since you mentioned below that you expect > some platforms to hit it in 'khz' rate. This is probably not good, to > trigger the kernel so often from HW/FW. > > That's why I have been struggling to find a 'good' design on this > glue layer for Arm FW+kernel. Our FW would probably won't cause such > huge notification traffic. A rate e.g. 50-100ms would be enough, > especially if we have the per-CPU cpufreq policy. So we might have > this 'PELT-like filter or signal' in FW, and just update kernel In this case arch_scale_thermal_pressure() doesn't reflect the actual thermal pressure but an average which is what thermal_load_avg() is also doing. > less often. But then there is an issue with the rising/decaying > penalty of the kernel thermal pressure signal. > > We cannot assume that some SoCs don't do this already. > > Let's meet in the middle: > 1) use the thermal PELT signal in RT: > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)) This is what Dietmar and I have been suggested > 2) introduce a more configurable thermal_pressure shifter instead > 'sched_thermal_decay_shift', which would allow not only to make the > decaying longer, but also shorter when the platform already might do > that, to not cause too much traffic. > > > > > - you change the range of capacity to max'= max - > > arch_scale_thermal_pressure and you scale it to max_freq. if util > > > max', then you will ask for max_freq whatever the util being really > > close to max or not. Also you will ask for max freq even if util is > > close but below max' whereas the mitigation doesn't impact utilization > > It's already there, even w/o patch. That's why I gave you the examples. Not sure how to understand this above. utilization can already goes above but this reflects a reality that the task could need more capacity than the current cpu capacity > > BTW, isn't true that the utilization of the Little CPU rq can reach > 1024 today after your change to the PELT when there is no idle time, > even when cpu max capacity is e.g. 300? yes > Before that change the utilization of a throttled CPU rq would converge > to the current capacity of the CPU, am I right? yes > > Is it this commit: > 23127296889fe84b0762b191 > > > > >> > >>> > >>>> [...] > >>> > >>>> but then in both cases are multiplied by 'max_freq' in (2) > >>>> > >>>> As you can see this is not the situation that you have described, is it? > >>>> And the transient or non-transient is minor here IMO. > >>> > >>> If max is 512 then util = 640 which is much lower than 1024. > >> > >> What scenario is this? > >> Is 1024 the utilization that we might have from the CPU rq? > >> What is the original CPU capacity, 1024? > > Is this 1024 the utilization of the CPU runqueue because since > the new PELT we can have it bigger than CPU capacity? > > >> > >>> > >>>> > >>>> Secondly, you have mentioned the mitigation in HW and issue between > >>>> instantaneous vs. PELT-one thermal pressure information. This is > >>>> something that I'm stretching my head for long. I'm trying to > >>>> develop this for new Arm FW thermal. You have mentioned: > >>>> 'thermal mitigation is managed by HW at a much higher > >>>> frequency than what Linux can handle' - I would be also more > >>>> precised here: HW or FW? How often the HW can change max freq or > >>>> how often FW can change that? If we don't have those numbers > >>>> than statement: 'a much higher' doesn't help in solving this > >>> > >>> By much higher means that Linux can't react fast enough and should not > >>> try to sync because it's a lost game > >> > >> As I said, 'much higher' is not a number to base a design on it. > > > > But that gives you the constraint that you can't expect to be always > > synced with up to date value which is the most important here. This > > means that cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong > > just after you computed it and your decision is wrong. > > This is hypothetical situation when the value can change in such > noisy way on some platform. But I understand your concern. > > > > > > >> We need real numbers from real platforms. Currently we have two > >> places where the thermal pressure is set: > >> 1) cpufreq_cooling.c [1] > >> 2) Qcom driver [2] > >> (we might have 3rd soon for Arm SCMI+FW) > > > > I don't have details but i have khz in mind > > If such traffic of interrupts in khz is true for driver in 2) > then it's a bit concerning. > > Although, smarter platforms shouldn't suffer due to design forced to one > corner case platform. > > > > >> > >> For the 2nd I would like to see numbers. For the 1st one when > >> kernel thermal is used (which supports higher number of platforms > >> comparing to Qcom driver) as it's by design kernel tries to control > >> thermal, so changes are not that frequent. > >> > >> As for now, I can see in experiments the 1st is suffering long decay > >> delays and also corner cases with long idle CPUs. > >> > >>> [...] > >> I'm trying to help Xuewen to solve his/her issues with the RT class > >> incrementally. I don't want to push him/her into a deep dark water > >> of PELT signals, to what variable compare them, corner cases when they > >> are (or not) updated or completely not implemented. I'm not even sure > >> if those extra complexities make sense for the RT/DL (since they > >> make some difference on big.mid.little specific platforms but not for > >> the rest). > > As I said we need a way forward, this issue of capacity inversion > on big.mid.little is there. It was for ~2-3years and is going to be > even bigger in future. So please don't block it and prepare/share the > numbers for the corner case platforms. I don't want to block anything but just want a solution that is coherent with the whole design and not just a fix for one UC > > I have proposed the where we can meet in the middle, consider it. > I will prepare a patch for that shifter.