Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
Date: Wed, 2 Mar 2022 11:07:14 +0000	[thread overview]
Message-ID: <42b114a7-04ea-71eb-7cd6-507fb4fb1655@linux.intel.com> (raw)
In-Reply-To: <8fb0a3a7-7968-79cb-9ea1-e31b0593acaa@intel.com>

On 01/03/2022 20:59, John Harrison wrote:
> On 3/1/2022 04:09, Tvrtko Ursulin wrote:
>>
>> I'll trim it a bit again..
>>
>> On 28/02/2022 18:55, John Harrison wrote:
>>> On 2/28/2022 09:12, Tvrtko Ursulin wrote:
>>>> On 25/02/2022 18:48, John Harrison wrote:
>>>>> On 2/25/2022 10:14, Tvrtko Ursulin wrote:
>>
>> [snip]
>>
>>>>>> Your only objection is that ends up with too long total time 
>>>>>> before reset? Or something else as well?
>>>>> An unnecessarily long total heartbeat timeout is the main 
>>>>> objection. (2.5 + 12) * 5 = 72.5 seconds. That is a massive change 
>>>>> from the current 12.5s.
>>>>>
>>>>> If we are happy with that huge increase then fine. But I'm pretty 
>>>>> sure you are going to get a lot more bug reports about hung systems 
>>>>> not recovering. 10-20s is just about long enough for someone to 
>>>>> wait before leaning on the power button of their machine. Over a 
>>>>> minute is not. That kind of delay is going to cause support issues.
>>>>
>>>> Sorry I wrote 12s, while you actually said tP * 12, so 7.68s, chosen 
>>>> just so it is longer than tH * 3?
>>>>
>>>> And how do you keep coming up with factor of five? Isn't it four 
>>>> periods before "heartbeat stopped"? (Prio normal, hearbeat, barrier 
>>>> and then reset.)
>>> Prio starts at low not normal.
>>
>> Right, slipped my mind since I only keep seeing that one priority 
>> ladder block in intel_engine_heartbeat.c/heartbeat()..
>>
>>>> From the point of view of user experience I agree reasonable 
>>>> responsiveness is needed before user "reaches for the power button".
>>>>
>>>> In your proposal we are talking about 3 * 2.5s + 2 * 7.5s, so 22.5s.
>>>>
>>>> Question of workloads.. what is the actual preempt timeout compute 
>>>> is happy with? And I don't mean compute setups with disabled 
>>>> hangcheck, which you say they want anyway, but if we target defaults 
>>>> for end users. Do we have some numbers on what they are likely to run?
>>> Not that I have ever seen. This is all just finger in the air stuff. 
>>> I don't recall if we invented the number and the compute people 
>>> agreed with it or if they proposed the number to us.
>>
>> Yeah me neither. And found nothing in my email archives. :(
>>
>> Thinking about it today I don't see that disabled timeout is a 
>> practical default.
>>
>> With it, if users have something un-preemptable to run (assuming prio 
>> normal), it would get killed after ~13s (5 * 2.5).
>>
>> If we go for my scheme it gets killed in ~17.5s (3 * (2.5 + 2.5) + 2.5 
>> (third pulse triggers preempt timeout)).
>>
>> And if we go for your scheme it gets killed in ~22.5s (4 * 2.5 + 2 * 3 
>> * 2.5).
> Erm, that is not an apples to apples comparison. Your 17.5 is for an 
> engine reset tripped by the pre-emption timeout, but your 22.5s is for a 
> GT reset tripped by the heartbeat reaching the end and nuking the universe.

Right, in your scheme I did get it wrong. It would wait for GuC to reset 
the engine at the end, rather than hit the fake "hearbeat stopped" in 
that case, full reset path.

4 * 2.5 to trigger a max prio pulse, then 3 * 2.5 preempt timeout for 
GuC to reset (last hearbeat delay extended so it does not trigger). So 
17.5 as well.

> If you are saying that the first pulse at sufficient priority (third 
> being normal prio) is what causes the reset because the system is 
> working as expected and the pre-emption timeout trips the reset. In that 
> case, you have two periods to get to normal prio plus one pre-emption 
> timeout to trip the reset. I.e. (tH * 2) + tP.
> 
> Your scheme is then tH(actual) = tH(user) + tP, yes?
> So pre-emption based reset is after ((tH(user) + tP) * 2) + tP => (3 * 
> tP) + (2 * tH)
> And GT based reset is after (tH(user) + tP) * 5 => (5 * tP) + (5 * tH)
> 
> My scheme is tH(actual) = tH(user) for first four, then max(tH(user), 
> tP) for fifth.
> So pre-emption based reset is after tH(user) * 2 + tP = > tP + (2 * tH);
> And GT based reset is after (tH(user) * 4) + (max(tH(user), tP) * 1) => 
> greater of ((4 * tH) + tP) or (5 * tH)
> 
> Either way your scheme is longer. With tH(user) = 2.5s, tP(RCS) = 7.5s, 
> we get 27.5s for engine and 50s for GT versus my 12.5s for engine and 
> 17.5s for GT. With tP(RCS) = 2.5s, yours is 12.5s for engine and 25s for 
> GT versus my 7.5s for engine and 12.5s for GT.
> 
> Plus, not sure why your calculations above are using 2.5 for tP? Are you 
> still arguing that 7.5s is too long? That is a separate issue and not 
> related to the heartbeat algorithms. tP must be long enough to allow 
> 'out of box OpenCL workloads to complete'. That doesn't just mean not 
> being killed by the heartbeat, it also means not being killed by running 
> two of them concurrently (or one plus desktop OpenGL rendering) and not 
> having it killed by basic time slicing between the two contexts. The 
> heartbeat is not involved in that process. That is purely the 
> pre-emption timeout. And that is the fundamental reason why tP needs to 
> be much larger on RCS/CCS.

I was assuming 2.5s tP is enough and basing all calculation on that. 
Heartbeat or timeslicing regardless. I thought we established neither of 
us knows how long is enough.

Are you now saying 2.5s is definitely not enough? How is that usable for 
a default out of the box desktop?

>> If I did not confuse any calculation this time round, then the 
>> differences for default case for normal priority sound pretty 
>> immaterial to me.
>>
>>>> What if we gave them a default of 2.5s? That would be 4 * (2.5s + 
>>>> 2.5s) = 20s worst case until reset, comparable to your proposal. Are 
>>>> there realistic workloads which are non-preemptable for 2.5s? I am 
>>>> having hard time imagining someone would run them on a general 
>>>> purpose desktop since it would mean frozen and unusable UI anyway.
>>>>
>>>> Advantage still being in my mind that there would be no fudging of 
>>>> timeouts during driver load and heartbeat periods depending on 
>>>> priority. To me it feels more plausible to account for preempt 
>>>> timeout in heartbeat pulses that to calculate preempt timeout to be 
>>>> longer than hearbeat pulses, just to avoid races between the two.
>>> Except that when the user asks for a heartbeat period of 2.5s you are 
>>> actually setting it to 5s. How is that not a major fudge that is 
>>> totally disregarding the user's request?
>>
>> This is indeed the core question. My thinking:
>>
>> It is not defined in the heartbeat ABI that N pulses should do 
>> anything, just that they are periodic pulses which check the health of 
>> an engine.
>>
>> If we view user priority as not under our control then we can say that 
>> any heartbeat pulse can trigger preempt timeout and we should let it 
>> do that.
>>
>> From that it follows that it is justified to account for preempt 
>> timeout in the decision when to schedule heartbeat pulses and that it 
>> is legitimate to do it for all of them.
> But it can be optimised to say that it doesn't matter if you bump the 
> priority of a pulse before waiting for the pre-emption period to expire. 
> If the pulse was already high enough prio then the pre-emption has 
> already been triggered and bumping the prio has no effect. If was too 
> low then waiting for longer has no benefit at all.
> 
> All that matters is that you don't hit the end stop and trigger the GT 
> reset too early.

Yes I agree that it can also be argued what you are saying.

I was trying to weigh pros&cons of both approaches by bringing into the 
discussing the question of what are heartbeats. Given they are 
loosely/vaguely defined I think we have freedom to tweak things.

And I don't have a problem with extending the last pulse. It is 
fundamentally correct to do regardless of the backend. I just raised the 
question of whether to extend all heartbeats to account for preemption 
(and scheduling delays). (What is the point of bumping their priority 
and re-scheduling if we didn't give enough time to the engine to react? 
So opposite of the question you raise.)

What I do have a problem with is deriving the preempt timeout from the 
hearbeat period. Hence I am looking if we can instead find a fixed 
number which works, and so avoid having bi-directional coupling.

>> It also avoids the double reset problem, regardless of the backend and 
>> regardless of how the user configured the timeouts. Without the need 
>> to fudge them neither during driver load or during sysfs store.
>>
>> User has configured that heartbeat pulses should be sent every N 
>> seconds, yes, but I think we are free to account for inherent hardware 
>> and software latencies in our calculations. Especially since other 
>> than flawed Gen12 RCS, other engines will be much closer to the 
>> configured period.
>>
>> It is just the same as user asking for preempt timeout N and we say on 
>> driver load "oh no you won't get it". Same for heartbeats, they said 
>> 2.5s, we said 2.5s + broken engine factor...
> Why would you not get the pre-emption timeout? Because it is too large? 
> That is a physical limitation (of the GuC firmware) not an override 
> because we think we know better. If we can obey the user then we should 
> do so.

I was simply referring to the override in intel_engine_setup.

Regards,

Tvrtko