Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
Date: Thu, 3 Mar 2022 09:55:31 +0000	[thread overview]
Message-ID: <f9daab55-6bda-b359-352d-8e585bce899c@linux.intel.com> (raw)
In-Reply-To: <389c16df-f579-81df-8405-376fcf8ce613@intel.com>

On 02/03/2022 17:55, John Harrison wrote:

>> I was assuming 2.5s tP is enough and basing all calculation on that. 
>> Heartbeat or timeslicing regardless. I thought we established neither 
>> of us knows how long is enough.
>>
>> Are you now saying 2.5s is definitely not enough? How is that usable 
>> for a default out of the box desktop?
> Show me your proof that 2.5s is enough.
> 
> 7.5s is what we have been using internally for a very long time. It has 
> approval from all relevant parties. If you wish to pick a new random 
> number then please provide data to back it up along with buy in from all 
> UMD teams and project management.

And upstream disabled preemption has acks from compute. "Internally" is 
as far away from out of the box desktop experiences we have been arguing 
about. In fact you have argued compute disables the hearbeat anyway.

Lets jump to the end of this reply please for actions.

>> And I don't have a problem with extending the last pulse. It is 
>> fundamentally correct to do regardless of the backend. I just raised 
>> the question of whether to extend all heartbeats to account for 
>> preemption (and scheduling delays). (What is the point of bumping 
>> their priority and re-scheduling if we didn't give enough time to the 
>> engine to react? So opposite of the question you raise.)
> The point is that it we are giving enough time to react. Raising the 
> priority of a pre-emption that has already been triggered will have no 
> effect. So as long as the total time from when the pre-emption is 
> triggered (prio becomes sufficiently high) to the point when the reset 
> is decided is longer than the pre-emption timeout then it works. Given 
> that, it is unnecessary to increase the intermediate periods. It has no 
> advantage and has the disadvantage of making the total time unreasonably 
> long.
> 
> So again, what is the point of making every period longer? What benefit 
> does it *actually* give?

Less special casing and pointless prio bumps ahead of giving time to 
engine to even react. You wouldn't have to have the last pulse 2 * tP 
but normal tH + tP. So again, it is nicer for me to derive all heartbeat 
pulses from the same input parameters.

The whole "it is very long" argument is IMO moot because now proposed 
7.5s preempt period is I suspect wholly impractical for desktop. 
Combined with the argument that real compute disables heartbeats anyway 
even extra so.

> Fine. "tP(RCS) = 7500;" can I merge the patch now?
I could live with setting preempt timeout to 7.5s. The downside is 
slower time to restoring frozen desktops. Worst case today 5 * 2.5s, 
with changes 4 * 2.5s + 2 * 7.5s; so from 12.5s to 25s, doubling..

Actions:

1)
Get a number from compute/OpenCL people for what they say is minimum 
preempt timeout for default out of the box Linux desktop experience.

This does not mean them running some tests and can't be bothered to 
setup up the machine for the extreme use cases, but workloads average 
users can realistically be expected to run.

Say for instance some image manipulation software which is OpenCL 
accelerated or similar. How long unpreemptable sections are expected 
there. Or similar. I am not familiar what all OpenCL accelerated use 
cases there are on Linux.

And this number should be purely about minimum preempt timeout, not 
considering heartbeats. This is because preempt timeout may kick in 
sooner than stopped heartbeat if the user workload is low priority.

2)
Commit message should explain the effect on the worst case time until 
engine reset.

3)
OpenCL/compute should ack the change publicly as well since they acked 
the disabling of preemption.

4)
I really want overflows_type in the first patch.

My position is that with the above satisfied it is okay to merge.

Regards,

Tvrtko