"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote:
>> 
>
> [cut]
>
>> >> >
>> >> >> >> No, I explicitly dismissed that in my previous reply.
>> >> >> >
>> >> >> > But at the same time you seem to agree that without the non-CPU com=
>> pon=3D
>> >> ent
>> >> >> > (or thermal pressure) the existing CPU performance scaling would be
>> >> >> > sufficient.
>> >> >> >
>> >> >>=3D20
>> >> >> Yes, but not necessarily in order to allow the non-CPU component to d=
>> raw
>> >> >> more power as you said above, but also because the existence of a
>> >> >> bottleneck in a non-CPU component gives us an opportunity to improve =
>> the
>> >> >> energy efficiency of the CPU, regardless of whether that allows the
>> >> >> workload to run faster.
>> >> >
>> >> > But why would the bottleneck be there otherwise?
>> >> >
>> >>=20
>> >> Because some resource of the system (e.g. memory bandwidth, GPU fill
>> >> rate) may be close to 100% utilized, causing a bottleneck for reasons
>> >> unrelated to its energy usage.
>> >
>> > Well, not quite.  Or at least in that case the performance cannot be impr=
>> oved
>> > by limiting the CPU frequency below the frequency looked for by scaling
>> > governors, AFAICS.
>> >
>> 
>> Right, but it might still be possible to improve the energy efficiency
>> of the workload even if its throughput cannot be improved, which seems
>> like a worthwhile purpose in itself.
>
> My point is that in this case the energy-efficiency of the processor cannot
> be improved without decreasing performance.
>

Well in principle it can whenever there is a bottleneck in a non-CPU
component.

> For the processors that are relevant here, the most energy-efficient way to
> run them is in the minimum P-state, but that rarely provides the required
> performance.  Without the knowledge on how much performance really is required
> we assume maximum achievable.  Anything else would need to involve some extra
> policy knobs (which are missing ATM) and that is another problem.
>

But there is a middle ground between limiting the workload to run at the
minimum P-state and having it run at the maximum achievable P-state.
The MEF estimation we were just talking about can help.  However due to
its heuristic nature I certainly see the merit of having policy knobs
allowing the optimization to be controlled by users, which is why I
added such controls in v2.99, at your request.

> I'm not saying that it is not a problem, but it is not possible to say how much
> performance to sacrifice without any input from the user on that.
>
> IOW, this is a topic for another discussion.
>

I have the vague recollection that you brought that up already and I
agreed and implemented the changes you asked for.

>> > Scaling governors generally look for the maximum frequency at which there=
>>  is no
>> > CPU idle time in the workload.  At that frequency the CPU time required b=
>> y the
>> > workload to achieve the maximum performance is equal to the total CPU time
>> > available to it.  I till refer to that frequency as the maximum effective
>> > frequency (MEF) of the workload.
>> >
>> > By definition, running at frequencies above the MEF does not improve
>> > performance, but it causes CPU idle time to appear.  OTOH running at
>> > frequencies below the MEF increases the CPU time required by the workload
>> > to achieve the maximum performance, so effectively the workload does
>> > not get enough CPU time for the performance to be maximum, so it is lower
>> > than at the MEF.
>> >
>> 
>> Yes, we agree on that.
>> 
>> > Of course, the MEF is well-defined as long as the processor does not share
>> > the power budget with another component that is also used by the workload
>> > (say, a GPU).  Without the sharing of a power budget, the MEF can be dete=
>> rmined
>> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever=
>>  is
>> > the most convenient) alone, because it already depends on the speed of any
>> > memory etc accessed by the workload and slowing down the processor doesn't
>> > improve the performance (because the other components don't run any faster
>> > as a result of that).
>> >
>> > However, if the processor is sharing the power budget with a GPU (say), it
>> > is not enough to look at the CPU idle time to determine the MEF, because
>> > slowing down the processor generally causes the GPU to get more power whi=
>> ch
>> > allows it to run faster and CPUs can do more work, because they spend less
>> > time waiting for the GPU, so the CPU time available to the workload effec=
>> tively
>> > increases and it can achieve the maximum performance at a lower frequency.
>> > So there is "effective MEF" depending on the relative performance balance
>> > between the processor and the GPU and on what "fraction" of the workload
>> > runs on the GPU.
>> >
>> 
>> That doesn't mean that the MEF isn't well-defined in systems with a
>> shared power budget.  If you define it as the lowest frequency at which
>> the workload reaches maximum throughput, then there still is one even if
>> the system is TDP-bound: the maximum frequency at which the CPU and
>> device maintain their optimal power balance -- Above it the power budget
>> of the device will be constrained, causing it to run at less than
>> optimal throughput, also causing the workload as a whole to run at less
>> than optimal throughput.
>
> That is what I am calling the "effective MEF" and the main point is that it
> cannot be determined by looking at the CPU idle time alone.
>

And my main point is that it can (by looking at its average frequency
alone -- closely related but not fully equivalent to CPU idle time) *if*
the workload has reached a steady state close enough to its optimal
power balance.  Then I tried to explain how the governor I implemented
approaches such an optimal steady state from any arbitrary sub-optimal
state *without* relying on monitoring power consumption via RAPL.

>> That said I agree with you that it takes more than looking at the CPU
>> utilization in order to determine the MEF in a system with a shared
>> power budget -- Until you've reached a steady state close enough to the
>> optimal power balance, at which point looking at the CPU utilization is
>> really all it takes in order for the governor to estimate the MEF.
>
> I'm not convinced, because in principle there may be many steady states
> at "boundary" frequencies, such that the CPU idle time goes from zero to
> nonzero when crossing the "boundary", in that case.
>
> That generally depends on what OPPs are available (physically) to the
> processor and the GPU (I'm using the GPU as an example, but that may be
> something else sharing the power budget with the processor, like an FPGA).
>
> For example, if increasing the CPU frequency above a "boundary" does not
> cause it to take up enough of the power budget to force the GPU to switch
> over to a lower-frequency OPP, it may very well cause some CPU idle time
> to appear, but that doesn't mean that the optimum power balance has been
> reached.
>

But you do agree that, under the assumption (1) of steady state, and the
assumption (2) that the processor has already crossed the right
boundaries for the power balance to be close to optimal, it is
sufficient to look at its average frequency ("average" defined with a
time parameter consistent with the latency constraints of the workload)
in order to estimate its MEF.  Or?

Due to the optimality assumption (2), the computational power that the
workload can extract out of the bottlenecking device is maximal, which
means that the amount of time the CPU spends waiting for the device per
unit of CPU work is minimal, which means that even if we run the CPU at
a higher frequency it won't take less time per unit of CPU work, which
means that we have reached the MEF.  Due to the steady state assumption
(1) we can then extrapolate the MEF estimate computed for the immediate
past into the immediate future, which is the heuristic part of this.

Then it remains to show that the controller will approach such an
optimal steady state even if assumption (2) doesn't hold initially,
which is what I tried to do in my previous e-mail at a high level.  Let
me know if my explanation wasn't clear enough.

> In general, the GPU needs to be monitored as well as the CPU and that's
> why the GPU bottleneck detection in your patches is key.  But having
> that in place one could simply put an upper limit on the CPU frequency
> through the existing policy max QoS in the cpufreq framework in response
> to the GPU bottleneck without changing the scaling governors.
>

Yes, at the cost of monitoring the average frequency of every CPU from
every IO device driver that can potentially benefit from improved CPU
energy efficiency, in order for each of them to compute an appropriate
MEF estimate for each CPU.  And at the cost of performance degradation
in a multitasking environment whenever two or more different process
impose conflicting frequency QoS constraints based on conflicting
latency requirements -- Or in a multitasking environment where a certain
process needs to be excluded from the frequency constraint, which you
were advocating for earlier in this thread.

>> IOW, introducing additional power budget variables (in combination with
>> additional power curve information from both the CPU and device) *might*
>> help you reach the optimal steady state from a suboptimal state more
>> quickly in principle, but it won't have any effect on the optimality of
>> the final steady state as soon as it's reached.
>> 
>> Does that mean it's essential to introduce such power variables in order
>> for the controller to approach the optimal steady state?  No, because
>> any realistic controller attempting to approximate the MEF of the
>> workload (whether it's monitoring the power consumption variables or
>> not) necessarily needs to overshoot that MEF estimate by some factor in
>> order to avoid getting stuck at a low frequency whenever the load
>> fluctuates above the current MEF.  This means that even if at some point
>> the power balance is far enough from the optimal ratio that the initial
>> MEF estimate is off by a fraction greater than the overshoot factor of
>> the controller, the controller will get immediate feedback of the
>> situation as soon as the device throughput ramps up due to the released
>> power budget, allowing it to make a more accurate approximation of the
>> real MEF in a small number of iterations (of the order of 1 iteration
>> surprisingly frequently).
>> 
>> > This means that the power budget sharing is essential here and the "if the
>> > energy-efficiency of the processor is improved, the other components get
>> > more power as a bonus" argument is not really valid.
>> >
>> 
>> That was just a statement of my goals while working on the algorithm
>> [improve the energy efficiency of the CPU in presence of an IO
>> bottleneck], it's therefore axiomatic in nature rather than some sort of
>> logical conclusion that can be dismissed as invalid.  You might say you
>> have different goals in mind but that doesn't mean other people's are
>> not valid.
>
> Well, this really isn't about the goals but about understanding of what
> really happens.
>
> What I'm trying to say is that the sharing of energy budget is a necessary
> condition allowing the processor's energy-efficiency to be improved without
> sacrificing performance.
>

And I disagree it's a necessary condition.  As a counterexample consider
a video game being rendered at 60 FPS on a discrete GPU without energy
budget sharing.  Suppose that only 40% of the CPU computational capacity
is needed in order to achieve that, but the CPU frequency peaks at 80%
of its maximum.  Then assume that we clamp the CPU frequency at 40% of
its maximum, still within the convexity range of its power curve.  While
doing that we have improved the processor's energy efficiency without
sacrificing performance.  And there was no sharing of energy budget
whatsoever.

>> > The frequency of the processor gets limited in order for the other compon=
>> ents
>> > to get more power, which then allows the processor to do more work in the
>> > same time at the same frequency, so it becomes more energy-efficient.
>> 
>> Whenever the throughput of the workload is limited by its power budget,
>> then yes, sure, but even when that's not the case it can be valuable to
>> reduce the amount of energy the system is consuming in order to perform
>> a certain task.
>
> Yes, it is valuable, but this is a separate problem and addressing it
> requires taking additional user input (regarding the energy vs performance
> policy) into account.

Yes, I agree that taking user input is valuable.  Feel free to provide
any feed-back on that matter if you don't consider the current policy
mechanism to be satisfactory.