Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

From: Qais Yousef <qyousef@layalina.io>
To: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
Date: Sun, 10 Sep 2023 19:14:19 +0100	[thread overview]
Message-ID: <20230910181419.ljejml3qazom2jgt@airbuntu> (raw)
In-Reply-To: <6011d8bb-9a3b-1435-30b0-d75b39bf5efa@arm.com>

On 09/07/23 08:48, Lukasz Luba wrote:

> They are periodic in a sense, they wake up every 16ms, but sometimes
> they have more work. It depends what is currently going in the game
> and/or sometimes the data locality (might not be in cache).
> 
> Although, that's for games, other workloads like youtube play or this
> one 'Yahoo browser' (from your example) are more 'predictable' (after
> the start up period). And I really like the potential energy saving
> there :)

It is more complicated than that from what I've seen. Userspace is sadly
bloated and the relationship between the tasks are a lot more complex. They
talk to other frame work elements, other hardware, have network elements coming
in, and specifically for gaming, could be preparing multiple frames in
parallel. The task wake up and sleep time is not that periodic. It can busy
loop for periods of time, other wake up for short periods of time (pattern of
which might not be on point as it interacts with other elements in a serial
manner where one prepared something and can take variable time every wake up to
prepare it before handing it over to the next task).

Browsers can be tricky as well as when user scrolls what elements appear and
what java script will execute and how heavy it is, and how many tabs are have
webpages being opened and how the user is moving between them.

It is organized chaos :-)

> 
> > 
> > I think the model of a periodic task is not suitable for most workloads. All
> > of them are dynamic and how much they need to do at each wake up can very
> > significantly over 10s of ms.
> 
> Might be true, the model was built a few years ago when there wasn't
> such dynamic game scenario with high FPS on mobiles. This could still
> be tuned with your new design IIUC (no need extra hooks in Android).

It is my perception of course. But I think generally, not just for gaming,
there are a lot of elements interacting with each others in a complex way.
The wake up time and length is determined by these complex elements; and it is
a very dynamic interaction where they could get into steady state for a very
short period of time but could change quickly. As an extreme example a player
could be standing in empty room doing nothing but another player in another
part of the world launches a rocket on this room and we'd get to know when the
network packet arrives that we have to draw a big explosion.

A lot of workloads are interactive and these moments of interactions are the
challenging ones.

> 
> > 
> > > 2. Plumb in this new idea of dvfs_update_delay as the new
> > >     'margin' - this I don't understand
> > > 
> > > For the 2. I don't see that the dvfs HW characteristics are best
> > > for this problem purpose. We can have a really fast DVFS HW,
> > > but we need some decent spare idle time in some workloads, which
> > > are two independent issues IMO. You might get the higher
> > > idle time thanks to 1.1. but this is a 'side effect'.
> > > 
> > > Could you explain a bit more why this dvfs_update_delay is
> > > crucial here?
> > 
> > I'm not sure why you relate this to idle time. And the word margin is a bit
> > overloaded here. so I suppose you're referring to the one we have in
> > map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
> > headroom will result in idle time, but this is not necessarily true IMO.
> > 
> > My rationale is simply that DVFS based on util should follow util_avg as-is.
> > But as pointed out in different discussions happened elsewhere, we need to
> > provide a headroom for this util to grow as if we were to be exact and the task
> > continues to run, then likely the util will go above the current OPP before we
> > get a chance to change it again. If we do have an ideal hardware that takes
> 
> Yes, this is another requirement to have +X% margin. When the tasks are
> growing, we don't know their final util_avg and we give them a bit more
> cycles.
> IMO we have to be ready always for such situation in the scheduler,
> haven't we?

Yes we should. I think I am not ignoring this part. Hope I clarified things
offline.

Cheers

--
Qais Yousef