All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org, "Bloomfield,
	Jon" <jon.bloomfield@intel.com>,
	"Ewins, Jon" <jon.ewins@intel.com>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
Date: Fri, 4 Mar 2022 12:36:01 +0000	[thread overview]
Message-ID: <2bb1563f-83b6-495e-db8f-63b870744bf3@linux.intel.com> (raw)
In-Reply-To: <711fbb74-e6fd-c78b-5c01-9cb1d1b6d624@intel.com>


On 03/03/2022 19:09, John Harrison wrote:

>> Actions:
>>
>> 1)
>> Get a number from compute/OpenCL people for what they say is minimum 
>> preempt timeout for default out of the box Linux desktop experience.
> That would be the one that has been agreed upon by both linux software 
> arch and all UMD teams and has been in use for the past year or more in 
> the internal tree.

What has been used in the internal tree is irrelevant when UMD ack is needed for changes which affect upstream shipping platforms like Tigerlake.

>> This does not mean them running some tests and can't be bothered to 
>> setup up the machine for the extreme use cases, but workloads average 
>> users can realistically be expected to run.
>>
>> Say for instance some image manipulation software which is OpenCL 
>> accelerated or similar. How long unpreemptable sections are expected 
>> there. Or similar. I am not familiar what all OpenCL accelerated use 
>> cases there are on Linux.
>>
>> And this number should be purely about minimum preempt timeout, not 
>> considering heartbeats. This is because preempt timeout may kick in 
>> sooner than stopped heartbeat if the user workload is low priority.
>>
> And driver is simply hosed in the intervening six months or more that it 
> takes for the right people to find the time to do this.

What is hosed? Driver currently contains a patch which was acked by the compute UMD to disable preemption. If it takes six months for compute UMD to give us a different number which works for the open source stack and typical use cases then what can we do.

> Right now, it is broken. This patch set improves things. Actual numbers 
> can be refined later as/when some random use case that we hadn't 
> previously thought of pops up. But not fixing the basic problem at all 
> until we have an absolutely perfect for all parties solution is 
> pointless. Not least because there is no perfect solution. No matter 
> what number you pick it is going to be wrong for someone.
> 
> 2.5s, 7.5s, X.Ys, I really don't care. 2.5s is a number you seem to have 
> picked out of the air totally at random, or maybe based on it being the 
> heartbeat period (except that you keep arguing that basing tP on tH is 
> wrong). 7.5s is a number that has been in active use for a lot of 
> testing for quite some time - KMD CI, UMD CI, E2E, etc. But either way, 
> the initial number is almost irrelevant as long as it is not zero. So 
> can we please just get something merged now as a starting point?
> 
> 
>> 2)
>> Commit message should explain the effect on the worst case time until 
>> engine reset.
>>
>> 3)
>> OpenCL/compute should ack the change publicly as well since they acked 
>> the disabling of preemption.
> This patch set has already been publicly acked by the compute team. See 
> the 'acked-by' tag.

I can't find the reply which contained the ack on the mailing list - do you have a msg-id or an archive link?

Also, ack needs to be against the the fixed timeout patch and not one dependent on the heartbeat interval.

>> 4)
>> I really want overflows_type in the first patch.
> In the final GuC assignment? Only if it is a BUG_ON. If we get a failure 
> there it is an internal driver error and cannot be corrected for. It is 
> too late for any plausible range check action.

If you can find a test which exercises setting insane values to the relevant timeouts and so would hit the problem in our CI then BUG_ON is fine. Otherwise I think BUG_ON is too anti-social and prefer drm_warn or drm_WARN_ON. I don't think adding a test is strictly necessary, if we don't already have one, given how unlikely this is too be hit, but if you insist on a BUG_ON instead of a flavour of a warn then I think we need one so we can catch in CI 100% of the time.
  
> And if you mean in the the actual helper function with the rest of the 
> clamping then you are bleeding internal GuC API structure details into 
> non-GuC code. Plus the test would be right next to the 'if (size < 

In my other reply I exactly described that would be a downside and that I prefer checks at the assignment sites.

Also regarding this comment in the relevant patch:

+	/*
+	 * NB: The GuC API only supports 32bit values. However, the limit is further
+	 * reduced due to internal calculations which would otherwise overflow.
+	 */

I would suggest clarifying this as "The GuC API only supports timeouts up to U32_MAX micro-seconds. However, ...". Given the function at hand deals in milliseconds explicitly calling out that additional scaling factor makes sense I think.

Big picture - it's really still very simple. Public ack for a fixed number and a warn on is not really a lot to ask.

Regards,

Tvrtko

  reply	other threads:[~2022-03-04 12:36 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 21:33 [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-02-18 21:33 ` [Intel-gfx] " John.C.Harrison
2022-02-18 21:33 ` [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22  9:52   ` Tvrtko Ursulin
2022-02-22 10:39     ` Tvrtko Ursulin
2022-02-23  2:11     ` John Harrison
2022-02-23 12:13       ` Tvrtko Ursulin
2022-02-23 19:03         ` John Harrison
2022-02-24  9:59           ` Tvrtko Ursulin
2022-02-24 19:19             ` John Harrison
2022-02-24 19:51               ` John Harrison
2022-02-25 17:44                 ` Tvrtko Ursulin
2022-02-25 17:06               ` Tvrtko Ursulin
2022-02-25 17:39                 ` John Harrison
2022-02-28 16:11                   ` Tvrtko Ursulin
2022-02-28 18:32                     ` John Harrison
2022-03-01 10:50                       ` Tvrtko Ursulin
2022-03-01 19:57                         ` John Harrison
2022-03-02  9:20                           ` Tvrtko Ursulin
2022-03-02 18:07                             ` John Harrison
2022-02-23  0:52   ` Ceraolo Spurio, Daniele
2022-02-23  2:15     ` John Harrison
2022-02-18 21:33 ` [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22 11:19   ` Tvrtko Ursulin
2022-02-23  2:45     ` John Harrison
2022-02-23 13:58       ` Tvrtko Ursulin
2022-02-23 20:00         ` John Harrison
2022-02-24 11:41           ` Tvrtko Ursulin
2022-02-24 19:45             ` John Harrison
2022-02-25 18:14               ` Tvrtko Ursulin
2022-02-25 18:48                 ` John Harrison
2022-02-28 17:12                   ` Tvrtko Ursulin
2022-02-28 18:55                     ` John Harrison
2022-03-01 12:09                       ` Tvrtko Ursulin
2022-03-01 20:59                         ` John Harrison
2022-03-02 11:07                           ` Tvrtko Ursulin
2022-03-02 17:55                             ` John Harrison
2022-03-03  9:55                               ` Tvrtko Ursulin
2022-03-03 19:09                                 ` John Harrison
2022-03-04 12:36                                   ` Tvrtko Ursulin [this message]
2022-02-18 21:33 ` [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-19  2:54 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Improve anti-pre-emption w/a for compute workloads Patchwork
2022-02-19  3:33 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-02-22  9:53 ` [Intel-gfx] [PATCH 0/3] " Tvrtko Ursulin
2022-02-23  2:22   ` John Harrison
2022-02-23 12:00     ` Tvrtko Ursulin
2022-02-24 20:02       ` John Harrison
2022-02-25 16:36         ` Tvrtko Ursulin
2022-02-25 17:11           ` John Harrison
2022-02-25 17:39             ` Tvrtko Ursulin
2022-02-25 18:01               ` John Harrison
2022-02-25 18:29                 ` Tvrtko Ursulin
2022-02-25 19:03                   ` John Harrison
2022-02-28 15:32                     ` Tvrtko Ursulin
2022-02-28 19:17                       ` John Harrison
2022-03-02 11:21                         ` Tvrtko Ursulin
2022-03-02 17:40                           ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2bb1563f-83b6-495e-db8f-63b870744bf3@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=john.c.harrison@intel.com \
    --cc=jon.bloomfield@intel.com \
    --cc=jon.ewins@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.