Re: [PATCH V2] sched: fair: Use the earliest break even

From: Daniel Lezcano <daniel.lezcano@linaro.org>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Qais Yousef <qais.yousef@arm.com>,
	Valentin Schneider <valentin.schneider@arm.com>
Subject: Re: [PATCH V2] sched: fair: Use the earliest break even
Date: Tue, 17 Mar 2020 18:07:43 +0100	[thread overview]
Message-ID: <7cd04d35-3522-30fb-82e9-82fdf53d0957@linaro.org> (raw)
In-Reply-To: <20200317143053.GF10914@e105550-lin.cambridge.arm.com>

On 17/03/2020 15:30, Morten Rasmussen wrote:
> On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
>>
>> Hi Morten,
>>
>> On 17/03/2020 08:56, Morten Rasmussen wrote:
>>> Hi Daniel,
>>>
>>> First, I think letting the scheduler know about desired minimum idle
>>> times is an interesting optimization if the overhead can be kept at a
>>> minimum. I do have a few comments about the patch though.
>>>
>>> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
>>>> On 12/03/2020 09:36, Vincent Guittot wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>>>>>>
>>>>>> In the idle CPU selection process occuring in the slow path via the
>>>>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>>>>>> with the shallowest idle state otherwise we fall back to the least
>>>>>> loaded CPU.
>>>>>
>>>>> The idea makes sense but this path is only used by fork and exec so
>>>>> I'm not sure about the real impact
>>>>
>>>> I agree the fork / exec path is called much less often than the wake
>>>> path but it makes more sense for the decision.
>>>
>>> Looking at the flow in find_idlest_cpu(), AFAICT,
>>> find_idlest_group_cpu() is not actually making the final choice of CPU,
>>> so going through a lot of trouble there looking at idle states is
>>> pointless. Is there something I don't see?
>>>
>>> We fellow sd->child until groups == CPUs which which means that
>>> find_idlest_group() actually makes the final choice as the final group
>>> passed to find_idlest_group_cpu() is single-CPU group. The flow has been
>>> like that for years. Even before you added the initial idle-state
>>> awareness.
>>>
>>> I agree with Vincent, if this should really make a difference it should
>>> include wake-ups existing tasks too. Although I'm aware it would be a
>>> more invasive change. As said from the beginning, the idea is fine, but
>>> the current implementation should not make any measurable difference?
>>
>> I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
>> changes in it. That is the reason why the patch only changes the slow path.
> 
> Right. I'm not against being cautious at all. It would be interesting to
> evaluate how bad it really is. The extra time-stamping business cost is
> the same, so it really down how much we dare to use the information in
> the fast-path and change the CPU selection policy. And of course, how
> much can be gained by the change.

If the change provided by this patch is acceptable. I can give a try
with the fast path.

>>>>>> In order to be more energy efficient but without impacting the
>>>>>> performances, let's use another criteria: the break even deadline.
>>>>>>
>>>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>>>> compute the next deadline where the CPU could be woken up without
>>>>>> spending more energy to sleep.
>>>
>>> I don't follow the argument that sleeping longer should improve energy
>>> consumption. 
>>
>> May be it is not explained correctly.
>>
>> The patch is about selecting a CPU with the smallest break even deadline
>> value. In a group of idle CPUs in the same idle state, we will pick the
>> one with the smallest break even dead line which is the one with the
>> highest probability it already reached its target residency.
>>
>> It is best effort.
> 
> Indeed. I get what the patch does, I just don't see how the patch
> improves energy efficiency.

If the CPU is woken up before it reached the break even, the idle state
cost in energy is greater than the energy it saved.

Am I misunderstanding your point?

>>> The patch doesn't affect the number of idle state
>>> enter/exit cycles, so you spend the amount of energy on those
>>> transitions. The main change is that idle time get spread out, so CPUs
>>> are less likely to be in the process of entering an idle state when they
>>> are asked to wake back up again.
>>>
>>> Isn't it fair to say that we expect the total number of wake-ups remains
>>> unchanged? Total busy and idle times across all CPUs should remain the
>>> same too? Unless chosen idle-state is changed, which I don't think we
>>> expect either, there should be no net effect on energy? The main benefit
>>> is reduced wake-up latency I think.
>>>
>>> Regarding chosen idle state, I'm wondering how this patch affects the
>>> cpuidle governor's idle state selection. Could the spreading of wake-ups
>>> trick governor to pick a shallower idle-state for some idle CPUs because
>>> we actively spread wake-ups rather than consolidating them? Just a
>>> thought.
>>
>> May be I missed the point, why are we spreading the tasks?
> 
> Picking the CPU with the smallest break-even time-stamp means you pick
> the CPU that has been idle longest in the shallowest idle-state. If you
> periodically one-shot spawn tasks at a rate which is long enough that
> the shallowest state is the same for several CPUs, you would end up
> picking the least recently used CPU each time effectively spreading the
> wake-ups across all the CPUs in the same state.

Ok, thanks for the clarification.

> Thinking more about it, it might not be a real problem as if one of the
> CPUs suddenly choose a shallower idle-state, it would become the target
> all new tasks from that point onwards.

Right.

>> We are taking the decision on the same sched domain, no?
> 
> I'm not sure I get the relation to the sched_domain?

Never mind. I misunderstood your comment.

>>>>>> At the selection process, we use the shallowest CPU but in addition we
>>>>>> choose the one with the minimal break even deadline instead of relying
>>>>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>>>>>> meaning because the CPU could have wake up and sleep again several times
>>>>>> without exiting the idle loop. In this case the break even deadline is
>>>>>> more relevant as it increases the probability of choosing a CPU which
>>>>>> reached its break even.
>>>
>>> I guess you could improve the idle time stamping without adding the
>>> break-even time, they don't have to go together?
>>
>> Yes, we can add the idle start time when entering idle in the
>> cpuidle_enter function which is different from the idle_timestamp which
>> gives the idle task scheduling. I sent a RFC for that [1].
>>
>> However, each time we would like to inspect the deadline, we will have
>> to compute it, so IMO it makes more sense to pre-compute it when
>> entering idle in addition to the idle start.
>>
>> [1] https://lkml.org/lkml/2020/3/16/902
> 
> Yes, I saw that patch too. Seems to make sense :-)
> 
>>
>>>>>> Tested on:
>>>>>>  - a synquacer 24 cores, 6 sched domains
>>>>>>  - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>>>>>
>>>>>> sched/perf and messaging does not show a performance regression. Ran
>>>>>> 50 times schbench, adrestia and forkbench.
>>>>>>
>>>>>> The tools described at https://lwn.net/Articles/724935/
>>>>>>
>>>>>>  --------------------------------------------------------------
>>>>>> | Synquacer             | With break even | Without break even |
>>>>>>  --------------------------------------------------------------
>>>>>> | schbench *99.0th      |      14844.8    |         15017.6    |
>>>>>> | adrestia / periodic   |        57.95    |              57    |
>>>>>> | adrestia / single     |         49.3    |            55.4    |
>>>>>>  --------------------------------------------------------------
>>>>>
>>>>> Have you got some figures or cpuidle statistics for the syncquacer ?
>>>>
>>>> No, and we just noticed the syncquacer has a bug in the firmware and
>>>> does not actually go to the idle states.
>>>
>>> I would also like some statistics to help understanding what actually
>>> changes.
>>>
>>> I did some measurements on TX2, which only has one idle-state. I don't
>>> see the same trends as you do. adrestia single seems to be most affected
>>> by the patch, but _increases_ with the break_even patch rather than
>>> decrease. I don't trust adrestia too much though as the time resolution
>>> is low on TX2.
>>>
>>> TX2			tip		break_even
>>> ----------------------------------------------------
>>> adrestia / single	5.21		5.51
>>> adrestia / periodic	5.75		5.67
>>> schbench 99.0th		45465.6		45376.0
>>> hackbench		27.9851		27.9775
>>>
>>> Notes:
>>> adrestia: Avg of 100 runs: adrestia -l 25000
>>> schbench: Avg of 10 runs: schbench -m16 -t64
>>> hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000
>>
>> Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
>> case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
>> don't know how that can interact with the testing.
> 
> Sorry, I should have been clearer. It is a ThunderX2. 2x 32-core (128
> threads) = 256 HW threads.

Indeed, that is different :)

I've a syncquacer but we find out a bug in the firmware, hopefully it
can be fixed.

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog