linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qais Yousef <qyousef@layalina.io>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel@vger.kernel.org, Xuewen Yan <xuewen.yan94@gmail.com>,
	Lukasz Luba <lukasz.luba@arm.com>, Wei Wang <wvw@google.com>,
	Jonathan JMChen <Jonathan.JMChen@mediatek.com>,
	Hank <han.lin@mediatek.com>
Subject: Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion
Date: Sun, 20 Nov 2022 21:30:13 +0000	[thread overview]
Message-ID: <20221120213013.t67xisvqxmftri52@airbuntu> (raw)
In-Reply-To: <5ec773d4-0a19-cc04-e3bc-4cb57cf55b09@arm.com>

On 11/16/22 18:45, Dietmar Eggemann wrote:
> On 12/11/2022 20:35, Qais Yousef wrote:
> > On 11/09/22 11:42, Dietmar Eggemann wrote:
> > 
> 
> [...]
> 
> >>> +			 * thermal pressure. We record it as capacity
> >>> +			 * inversion.
> >>> +			 */
> >>> +			if (capacity_orig == pd_cap_orig) {
> >>> +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> >>> +
> >>> +				if (pd_cap > inv_cap) {
> >>> +					rq->cpu_capacity_inverted = inv_cap;
> >>> +					break;
> >>> +				}
> >>
> >> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> >> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> >> erroneously since thermal_load_avg(rq) can return different values for
> >> inv_cap and pd_cap.
> > 
> > Good catch!
> > 
> >>
> >> So even on a classical big little system, this condition can set
> >> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> >>
> >> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> >> PD to avoid this.
> >>
> >> This is one example of the `thermal pressure` is per PD (or Frequency
> >> Domain) in Thermal but per-CPU in the task scheduler.
> > 
> > Only compile tested so far, does this patch address all your points? I should
> > get hardware soon to run some tests and send the patch. I might re-write it to
> > avoid using pds; though it seems cleaner this way but we miss CAS support.
> > 
> > Thoughts?
> 
> I still don't think that the `CPU capacity inversion` implementation
> which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
> `cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
> other PD exists w/ cap_orig > cap_orig') is the right answer, besides
> the EAS vs. CAS coverage.
> 
> The basic question for me is why do we have to switch between (1) and
> (2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
> min/max capping between schedutil and the CPUfreq driver
> __resolve_freq() [drivers/cpufreq/cpufreq.c] (3).
> 
> The policy caps are set together with thermal pressure in
> cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
> freq_qos_update_request().
> 
> I think we should only use (2) in the task scheduler even though the
> EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
> Thermal load avg has similar properties like all the other EWMA-based
> signals we use and we have to live with a certain degree of inaccuracy
> anyway (e.g. also because of lock-less CPU statistic fetching when
> selecting CPU).
> 
> And in this case we wouldn't have to have infrastructure to switch
> between (1) and (2) at all.
> 
> To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
> board with step-wise thermal governor tracing thermal load_avg
> (sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
> capacity (sched_cpu_capacity). Would we really gain something
> substantial reliably when we would know the diff between (1) and (2)?
> 
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb
> 

So what you're asking for is to switch to this?

	diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
	index b01854984994..989f1947bd34 100644
	--- a/kernel/sched/fair.c
	+++ b/kernel/sched/fair.c
	@@ -8862,7 +8862,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)

			rcu_read_lock();

	-               inv_cap = capacity_orig - thermal_load_avg(rq);
	+               inv_cap = capacity_orig - arch_scale_thermal_pressure(rq);
			pd = rcu_dereference(rq->rd->pd);
			rq->cpu_capacity_inverted = 0;

	@@ -8887,7 +8887,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
				 * inversion.
				 */
				if (capacity_orig == pd_cap_orig) {
	-                               pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
	+                               pd_cap = pd_cap_orig - arch_scale_thermal_pressure(cpu_rq(cpu));

					if (pd_cap > inv_cap) {
						rq->cpu_capacity_inverted = inv_cap;

My main worry is that rq->cpu_capacity which is updated in the same location
uses thermal_load_avg(). The consistency was important IMO. Besides I think we
need good certainty the inversion is there - we don't want to be oscillating.
Say the big core thermal pressure is increasing and it is entering capacity
inversion. If we don't use the average we'd be avoiding the cpu one tick, but
place something that drives freq high on it the next tick. This ping-pong could
end up not giving the big cores some breathing room to cool down and settle on
one state, no?

I think Lukasz patch [1] is very important in controlling this aspect. And it
might help make the code more consistent by enabling switching all users to
thermal_load_avg() if we can speed up its reaction time sufficiently.

That said; I don't have a bullet proof answer or a very strong opinion about
it. Either direction we take I think we'll have room for improvements. It
seemed the safer more sensible option to me at this stage.

[1] https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@arm.com/


Thanks!

--
Qais Yousef

  reply	other threads:[~2022-11-20 21:30 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 14:36 [PATCH v2 0/9] Fix relationship between uclamp and fits_capacity() Qais Yousef
2022-08-04 14:36 ` [PATCH v2 1/9] sched/uclamp: Fix relationship between uclamp and migration margin Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-11-04 17:35   ` [PATCH v2 1/9] " Valentin Schneider
2022-11-05 19:24     ` Qais Yousef
2022-11-07 18:58       ` Valentin Schneider
2022-11-08 11:33         ` Qais Yousef
2022-11-09 10:33   ` Dietmar Eggemann
2022-08-04 14:36 ` [PATCH v2 2/9] sched/uclamp: Make task_fits_capacity() use util_fits_cpu() Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 3/9] sched/uclamp: Fix fits_capacity() check in feec() Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 4/9] sched/uclamp: Make select_idle_capacity() use util_fits_cpu() Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 5/9] sched/uclamp: Make asym_fits_capacity() " Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 6/9] sched/uclamp: Make cpu_overutilized() " Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 7/9] sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-08-04 14:36 ` [PATCH v2 8/9] sched/fair: Detect capacity inversion Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-11-09 10:42   ` [PATCH v2 8/9] " Dietmar Eggemann
2022-11-12 19:35     ` Qais Yousef
2022-11-16 17:45       ` Dietmar Eggemann
2022-11-20 21:30         ` Qais Yousef [this message]
2022-08-04 14:36 ` [PATCH v2 9/9] sched/fair: Consider capacity inversion in util_fits_cpu() Qais Yousef
2022-10-28  6:42   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2022-11-04 17:35   ` [PATCH v2 9/9] " Valentin Schneider
2022-11-05 20:41     ` Qais Yousef
2022-11-07 18:58       ` Valentin Schneider
2022-11-08 11:51         ` Qais Yousef
2022-11-09 10:43       ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221120213013.t67xisvqxmftri52@airbuntu \
    --to=qyousef@layalina.io \
    --cc=Jonathan.JMChen@mediatek.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=han.lin@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukasz.luba@arm.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    --cc=wvw@google.com \
    --cc=xuewen.yan94@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).