Linux-PM Archive on lore.kernel.org
 help / color / Atom feed
From: Douglas Raillard <douglas.raillard@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	mingo@redhat.com, rjw@rjwysocki.net, viresh.kumar@linaro.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, qperret@qperret.net,
	patrick.bellasi@matbug.net, dh.han@samsung.com
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware
Date: Thu, 17 Oct 2019 15:23:04 +0100
Message-ID: <7edb1b73-54e7-5729-db5d-6b3b1b616064@arm.com> (raw)
In-Reply-To: <20191017095015.GI2311@hirez.programming.kicks-ass.net>

Hi Peter,

On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote:
> 
>> I posted some numbers based on a similar experiment on the v2 of that series that
>> are still applicable:
>>
>> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
>> increase of 3% in total energy consumption. There is a burst every 0.6s, and
>> the average power consumption increase is proportional to the average number
>> of bursts.
>>
>>
>> The workload is an rt-app task ramping up from 5% to 75% util in one big step,
>> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
>> This cycle is repeated 20 times and the average of boosting is taken.
>>
>> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
>> It has a lot more OPPs than a hikey 960, so gradations in boosting
>> are better reflected on frequency selection.
>>
>> avg slack (higher=better):
>>      Average time between task sleep and its next periodic activation.
>>      See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631
>>
>> avg negative slack (lower in absolute value=better):
>>      Same as avg slack, but only taking into account negative values.
>>      Negative slack means a task activation did not have enough time to complete before the next
>>      periodic activation fired, which is what we want to avoid.
>>
>> boost energy overhead (lower=better):
>>      Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
>>      and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
>>      since boost(policy) = max(boost(cpu) foreach cpu of policy)).
>>
>> Without ramp boost:
>> +--------------------+--------------------+
>> |avg slack (us)      |avg negative slack  |
>> |                    |(us)                |
>> +--------------------+--------------------+
>> |6598.72             |-10217.13           |
>> |6595.49             |-10200.13           |
>> |6613.72             |-10401.06           |
>> |6600.29             |-9860.872           |
>> |6605.53             |-10057.64           |
>> |6612.05             |-10267.50           |
>> |6599.01             |-9939.60            |
>> |6593.79             |-9445.633           |
>> |6613.56             |-10276.75           |
>> |6595.44             |-9751.770           |
>> +--------------------+--------------------+
>> |average                                  |
>> +--------------------+--------------------+
>> |6602.76             |-10041.81           |
>> +--------------------+--------------------+
>>
>>
>> With ramp boost enabled:
>> +--------------------+--------------------+--------------------+
>> |boost energy        |avg slack (us)      |avg negative slack  |
>> |overhead (%)        |                    |(us)                |
>> +--------------------+--------------------+--------------------+
>> |3.05                |7148.93             |-5664.26            |
>> |3.04                |7144.69             |-5667.77            |
>> |3.05                |7149.05             |-5698.31            |
>> |2.97                |7126.71             |-6040.23            |
>> |3.02                |7140.28             |-5826.78            |
>> |3.03                |7135.11             |-5749.62            |
>> |3.05                |7140.24             |-5750.0             |
>> |3.05                |7144.84             |-5667.04            |
>> |3.07                |7157.30             |-5656.65            |
>> |3.06                |7154.65             |-5653.76            |
>> +--------------------+--------------------+--------------------+
>> |average                                                       |
>> +--------------------+--------------------+--------------------+
>> |3.039000            |7144.18             |-5737.44            |
>> +--------------------+--------------------+--------------------+
>>
>>
>> The negative slack is due to missed activations while the utilization signals
>> increase during the big utilization step. Ramp boost is designed to boost frequency during
>> that phase, which materializes in 1.75 less negative slack, for an extra power
>> consumption under 3%.
> 
> OK, so I think I see what it is doing, and why.
> 
> Normally we use (map_util_freq):
> 
> 	freq = C * max_freq * util / max ; C=1.25
> 
> But here, when util is increasing, we effectively increase our C to
> allow picking a higher OPP. Because of that higher OPP we finish our
> work sooner (avg slack increases)

Indeed. The increase in avg slack is much smaller than the absolute decrease in negative slack,
which means we don't boost when that's not necessary
(i.e. on activations where the slack is already positive).

> and miss our activation less often
> (avg neg slack decreases).

We actually miss the activation as often, but by a lesser amount of time,
which is desirable. Preventing from missing activations altogether could
probably be achieved by very aggressive boosting, at which point it might
be better/easier to just assume util=1024 and let it decrease to its
correct value. None of that sounds very appealing in the general case though.

> Now, the thing is, we use map_util_freq() in more places, and should we
> not reflect this increase in C for all of them? That is, why is this
> patch changing get_next_freq() and not map_util_freq().
> 
> I don't think that question is answered in the Changelogs.
> 
> Exactly because it does change the energy consumption (it must) should
> that not also be reflected in the EAS logic?

map_util_freq() is only used in schedutil and in EAS by compute_energy(), so
I retarget the question at compute_energy(). It is supposed to compute
the energy consumed by a given CPU if a given task was migrated on it.
This implies some assumptions on util signals:
1) util(_est.enqueued) of the task is representative of the task's duty cycle
   (the semantic of util is somehow blurry for aperiodic tasks AFAIK).
2) util of the task is CPU-invariant
3) task util + target CPU util = new target CPU util for the foreseeable future,
    i.e. the amount of future we can predict with reasonable accuracy. Otherwise
    we would end up moving things around all the time.

Since ramp boost is designed to be transient, integrating it (indirectly) in "target CPU util" will
add some noise to the placement decision, potentially rendering it obsolete as soon
as the boosting stops. Basing a costly decision on that does not sound like a good idea IMHO.

> I'm still thinking about the exact means you're using to raise C; that
> is, the 'util - util_est' as cost_margin. It hurts my brain still.

util_est is currently the best approximation of the actual portion of the CPU the task needs:
1) for periodic tasks, it's not too far from the duty cycle, and is always higher

2) for aperiodic tasks, it (indirectly) takes into account the total time it took
   to complete the previous activation, so the signal is not 100% composed of logical signals
   only relevant for periodic tasks (although it's a big part of it).

3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
their duty cycle over time, without needing a very long history (the last task period
is sufficient).

For periodic tasks, the distance between instantaneous util_avg and the actual task
duty cycle indicates somehow what is our best guess of the (potential) change in the task
duty cycle.

util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
for sure that even if the task stopped right now, its duty cycle would be higher than
during the previous period.
This means for a given task and with (util >= util_est):

1) util - util_est == 0 means the task duty cycle will be equal to the one during
   during the previous activation, if the tasks stopped executing right now.

2) util - util_est > 0 means the task duty cycle will be higher to the one during
   during the previous activation, if the tasks stopped executing right now.

Using the difference (util - util_est) will therefore give these properties to the boost signal:
* no boost will be applied as long as the task has a constant or decreasing duty cycle.

* when we can detect that the duty cycle increases, we temporarily increase the frequency.
   We start with a slight increase, and the longer we wait for the current period to finish,
   the more we boost, since the more likely it is that the task has a much larger duty cycle
   than anticipated. More specifically, the evaluation of "how much more" is done the exact
   same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.
   
Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
which may or may not be desired. Ultimately, it's not more or less wrong than just picking
the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
tasks. It just allows reaching the max freq at some point without waiting for too long, which is
all what we can do without more info on the task.

When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
task needs boosting according to these rules. That only holds as long as the history we look at is
the result of a stable set of tasks, i.e. no tasks added or removed from the rq.


Regards,
Douglas

  parent reply index

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-11 13:44 Douglas RAILLARD
2019-10-11 13:44 ` [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD
2019-10-17  8:57   ` Dietmar Eggemann
2019-10-17  9:58   ` Dietmar Eggemann
2019-10-17 11:09     ` Douglas Raillard
2019-10-11 13:44 ` [RFC PATCH v3 2/6] sched/cpufreq: Attach perf domain to sugov policy Douglas RAILLARD
2019-10-17  8:57   ` Dietmar Eggemann
2019-10-17 10:22     ` Douglas Raillard
2019-10-11 13:44 ` [RFC PATCH v3 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() Douglas RAILLARD
2019-10-11 13:44 ` [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD
2019-10-14 14:33   ` Peter Zijlstra
2019-10-14 15:32     ` Douglas Raillard
2019-10-17  8:57   ` Dietmar Eggemann
2019-10-17 11:19     ` Douglas Raillard
2019-10-11 13:44 ` [RFC PATCH v3 5/6] sched/cpufreq: Boost schedutil frequency ramp up Douglas RAILLARD
2019-10-17  9:21   ` Dietmar Eggemann
2019-10-11 13:45 ` [RFC PATCH v3 6/6] sched/cpufreq: Add schedutil_em_tp tracepoint Douglas RAILLARD
2019-10-14 14:53 ` [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware Peter Zijlstra
2019-10-14 15:50   ` Douglas Raillard
2019-10-17  9:50     ` Peter Zijlstra
2019-10-17 11:11       ` Quentin Perret
2019-10-17 14:11         ` Peter Zijlstra
2019-10-18  7:44           ` Dietmar Eggemann
2019-10-18  7:59             ` Peter Zijlstra
2019-10-18 17:24               ` Douglas Raillard
2019-10-18  8:11             ` Peter Zijlstra
2019-10-17 14:23       ` Douglas Raillard [this message]
2019-10-17 14:53         ` Peter Zijlstra
2019-10-17 19:07         ` Peter Zijlstra
2019-10-18 11:46           ` Douglas Raillard
2019-10-18 12:07             ` Peter Zijlstra
2019-10-18 14:44               ` Douglas Raillard
2019-10-18 15:15                 ` Vincent Guittot
2019-10-18 16:03                   ` Douglas Raillard
2019-10-18 15:20                 ` Vincent Guittot

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7edb1b73-54e7-5729-db5d-6b3b1b616064@arm.com \
    --to=douglas.raillard@arm.com \
    --cc=dh.han@samsung.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=patrick.bellasi@matbug.net \
    --cc=peterz@infradead.org \
    --cc=qperret@qperret.net \
    --cc=rjw@rjwysocki.net \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-PM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pm/0 linux-pm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pm linux-pm/ https://lore.kernel.org/linux-pm \
		linux-pm@vger.kernel.org
	public-inbox-index linux-pm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git