Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Lukasz Luba <lukasz.luba@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Xuewen Yan <xuewen.yan@unisoc.com>,
	rafael@kernel.org, viresh.kumar@linaro.org, mingo@redhat.com,
	peterz@infradead.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, di.shen@unisoc.com,
	Xuewen Yan <xuewen.yan94@gmail.com>
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity
Date: Tue, 26 Apr 2022 09:39:32 +0200	[thread overview]
Message-ID: <CAKfTPtCu4zYchv1d4g-ztw=qR639BS2ncapQxfcwZqkSSQPY0w@mail.gmail.com> (raw)
In-Reply-To: <ae98a861-8945-e630-8d4c-8112723d1007@arm.com>

On Thu, 21 Apr 2022 at 12:57, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 4/21/22 09:29, Vincent Guittot wrote:
> > On Tue, 19 Apr 2022 at 16:13, Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >>
> >>
> >> On 4/19/22 13:51, Vincent Guittot wrote:
> >>> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/19/22 08:14, Vincent Guittot wrote:
> >>>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Luba  / Dietmar
> >>>>>>
> >>>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>

[...]

> >>>> To be precised and maybe fix some potential design issues. We are
> >>>> talking here about utilization and set max capacity in function:
> >>>> sugov_get_util()
> >>>> so fields:
> >>>>
> >>>> sugov_cpu::util
> >>>> sugov_cpu::max /* max capacity */
> >>>
> >>> Yes. With this patch ,util will be lower than current thermal
> >>> mitigation whereas util normally reflects what we need  not what can
> >>> be provided
> >>
> >> This is a different requirements: util has to be max capacity and
> >> max capacity has to be original max CPU capacity - for the SchedUtil.
> >> OK, why? What this requirement adds in the design and final values?
> >
> > Because the calculation you are proposing is wrong and doesn't make
> > sense. Util is the average utilization of the cpu that has to be
> > compared with its original capacity max in order to get the freq that
> > matches with this utilization.
> >
> > We have freq = util / max * max_freq and cpufreq will then capp freq
> > if mitigation is applied. Once the mitigation disappear, the request
> > will be back to targeted freq.
> >
> > If you replace max by max' = max - arch_scale_thermal_pressure then :
> >
> > - by the time you do the calculation, arch_scale_thermal_pressure can
> > have changed and the result is meaningless. This is true whatever the
> > pace of updating arch_scale_thermal_pressure
>
> The sudden change of the value taken from arch_scale_thermal_pressure
> I can understand, but there are similar and we live with it. Look at
> the whole EAS estimations done in a one CPU waku-up event or the uclamp
> stuff. As far this is not too frequently occurring - we live wit it.
>
> I can see your concern here, since you mentioned below that you expect
> some platforms to hit it in 'khz' rate. This is probably not good, to
> trigger the kernel so often from HW/FW.
>
> That's why I have been struggling to find a 'good' design on this
> glue layer for Arm FW+kernel. Our FW would probably won't cause such
> huge notification traffic. A rate e.g. 50-100ms would be enough,
> especially if we have the per-CPU cpufreq policy. So we might have
> this 'PELT-like filter or signal' in FW, and just update kernel

In this case arch_scale_thermal_pressure() doesn't reflect the actual
thermal pressure but an average which is what thermal_load_avg() is
also doing.

> less often. But then there is an issue with the rising/decaying
> penalty of the kernel thermal pressure signal.
>
> We cannot assume that some SoCs don't do this already.
>
> Let's meet in the middle:
> 1) use the thermal PELT signal in RT:
> capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))

This is what Dietmar and I have been suggested

> 2) introduce a more configurable thermal_pressure shifter instead
> 'sched_thermal_decay_shift', which would allow not only to make the
> decaying longer, but also shorter when the platform already might do
> that, to not cause too much traffic.
>
> >
> > - you change the range of capacity to max'= max -
> > arch_scale_thermal_pressure and you scale it to max_freq. if util >
> > max', then you will ask for max_freq whatever the util being really
> > close to max or not. Also you will ask for max freq even if util is
> > close but below max' whereas the mitigation doesn't impact utilization
>
> It's already there, even w/o patch. That's why I gave you the examples.

Not sure how to understand this above.

utilization can already goes above but this reflects a reality that
the task could need more capacity than the current cpu capacity

>
> BTW, isn't true that the utilization of the Little CPU rq can reach
> 1024 today after your change to the PELT when there is no idle time,
> even when cpu max capacity is e.g. 300?

yes

> Before that change the utilization of a throttled CPU rq would converge
> to the current capacity of the CPU, am I right?

yes

>
> Is it this commit:
> 23127296889fe84b0762b191
>
> >
> >>
> >>>
> >>>>

[...]

> >>>
> >>>> but then in both cases are multiplied by 'max_freq' in (2)
> >>>>
> >>>> As you can see this is not the situation that you have described, is it?
> >>>> And the transient or non-transient is minor here IMO.
> >>>
> >>> If max is 512 then util = 640 which is much lower than 1024.
> >>
> >> What scenario is this?
> >> Is 1024 the utilization that we might have from the CPU rq?
> >> What is the original CPU capacity, 1024?
>
> Is this 1024 the utilization of the CPU runqueue because since
> the new PELT we can have it bigger than CPU capacity?
>
> >>
> >>>
> >>>>
> >>>> Secondly, you have mentioned the mitigation in HW and issue between
> >>>> instantaneous vs. PELT-one thermal pressure information. This is
> >>>> something that I'm stretching my head for long. I'm trying to
> >>>> develop this for new Arm FW thermal. You have mentioned:
> >>>> 'thermal mitigation is managed by HW at a much higher
> >>>> frequency than what Linux can handle' - I would be also more
> >>>> precised here: HW or FW? How often the HW can change max freq or
> >>>> how often FW can change that? If we don't have those numbers
> >>>> than statement: 'a much higher' doesn't help in solving this
> >>>
> >>> By much higher means that Linux can't react fast enough and should not
> >>> try to sync because it's a lost game
> >>
> >> As I said, 'much higher' is not a number to base a design on it.
> >
> > But that gives you the constraint that you can't expect to be always
> > synced with up to date value which is the most important here. This
> > means that  cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong
> > just after you computed it and your decision is wrong.
>
> This is hypothetical situation when the value can change in such
> noisy way on some platform. But I understand your concern.
>
> >
> >
> >> We need real numbers from real platforms. Currently we have two
> >> places where the thermal pressure is set:
> >> 1) cpufreq_cooling.c [1]
> >> 2) Qcom driver [2]
> >> (we might have 3rd soon for Arm SCMI+FW)
> >
> > I don't have details but i have khz in mind
>
> If such traffic of interrupts in khz is true for driver in 2)
> then it's a bit concerning.
>
> Although, smarter platforms shouldn't suffer due to design forced to one
> corner case platform.
>
> >
> >>
> >> For the 2nd I would like to see numbers. For the 1st one when
> >> kernel thermal is used (which supports higher number of platforms
> >> comparing to Qcom driver) as it's by design kernel tries to control
> >> thermal, so changes are not that frequent.
> >>
> >> As for now, I can see in experiments the 1st is suffering long decay
> >> delays and also corner cases with long idle CPUs.
> >>
> >>>

[...]

> >> I'm trying to help Xuewen to solve his/her issues with the RT class
> >> incrementally. I don't want to push him/her into a deep dark water
> >> of PELT signals, to what variable compare them, corner cases when they
> >> are (or not) updated or completely not implemented. I'm not even sure
> >> if those extra complexities make sense for the RT/DL (since they
> >> make some difference on big.mid.little specific platforms but not for
> >> the rest).
>
> As I said we need a way forward, this issue of capacity inversion
> on big.mid.little is there. It was for ~2-3years and is going to be
> even bigger in future. So please don't block it and prepare/share the
> numbers for the corner case platforms.

I don't want to block anything but just want a solution that is
coherent with the whole design and not just a fix for one UC

>
> I have proposed the where we can meet in the middle, consider it.
> I will prepare a patch for that shifter.