All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads
Date: Fri, 25 Feb 2022 17:39:58 +0000	[thread overview]
Message-ID: <720fe7f8-a712-c755-9fbe-70ac2fa008f6@linux.intel.com> (raw)
In-Reply-To: <bb8d25fa-4b6f-0c53-302a-623787760bcd@intel.com>


On 25/02/2022 17:11, John Harrison wrote:
> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>> On 24/02/2022 20:02, John Harrison wrote:
>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>
>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>> hardware.
>>>>>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>> heartbeat
>>>>>>> and its (longer) timeout. This is undesirable with GuC submission as
>>>>>>> the heartbeat is a full GT reset rather than a per engine reset 
>>>>>>> and so
>>>>>>> is much more destructive. Instead, just bump the pre-emption timeout
>>>>>>
>>>>>> Can we have a feature request to allow asking GuC for an engine 
>>>>>> reset?
>>>>> For what purpose?
>>>>
>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>
>>>>> GuC manages the scheduling of contexts across engines. With virtual 
>>>>> engines, the KMD has no knowledge of which engine a context might 
>>>>> be executing on. Even without virtual engines, the KMD still has no 
>>>>> knowledge of which context is currently executing on any given 
>>>>> engine at any given time.
>>>>>
>>>>> There is a reason why hang detection should be left to the entity 
>>>>> that is doing the scheduling. Any other entity is second guessing 
>>>>> at best.
>>>>>
>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>> submission is enabled is for the case where the KMD/GuC have got 
>>>>> out of sync with either other somehow or GuC itself has just 
>>>>> crashed. I.e. when no submission at all is working and we need to 
>>>>> reset the GuC itself and start over.
>>>>
>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>> nerfed already in GuC mode.
>>> Not sure what you mean by that claim. Engine resets are handled by 
>>> GuC because GuC handles the scheduling. You can't do the former if 
>>> you aren't doing the latter. However, the heartbeat is still present 
>>> and is still the watchdog by which engine resets are triggered. As 
>>> per the rest of the submission process, the hang detection and 
>>> recovery is split between i915 and GuC.
>>
>> I meant that "stopped heartbeat on engine XXX" can only do a full GPU 
>> reset on GuC.
> I mean that there is no 'stopped heartbeat on engine XXX' when i915 is 
> not handling the recovery part of the process.

Hmmmm?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
		show_heartbeat(rq, engine);

	if (intel_engine_uses_guc(engine))
		/*
		 * GuC itself is toast or GuC's hang detection
		 * is disabled. Either way, need to find the
		 * hang culprit manually.
		 */
		intel_guc_find_hung_context(engine);

	intel_gt_handle_error(engine->gt, engine->mask,
			      I915_ERROR_CAPTURE,
			      "stopped heartbeat on %s",
			      engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it certainly looks there is.

You say below heartbeats are going in GuC mode. Now I totally don't understand how they are going but there is allegedly no "stopped hearbeat".

>>
>>     intel_gt_handle_error(engine->gt, engine->mask,
>>                   I915_ERROR_CAPTURE,
>>                   "stopped heartbeat on %s",
>>                   engine->name);
>>
>> intel_gt_handle_error:
>>
>>     /*
>>      * Try engine reset when available. We fall back to full reset if
>>      * single reset fails.
>>      */
>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>         local_bh_disable();
>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>
>> You said "However, the heartbeat is still present and is still the 
>> watchdog by which engine resets are triggered", now I don't know what 
>> you meant by this. It actually triggers a single engine reset in GuC 
>> mode? Where in code does that happen if this block above shows it not 
>> taking the engine reset path?
> i915 sends down the per engine pulse.
> GuC schedules the pulse
> GuC attempts to pre-empt the currently active context
> GuC detects the pre-emption timeout
> GuC resets the engine
> 
> The fundamental process is exactly the same as in execlist mode. It's 
> just that the above blocks of code (calls to intel_gt_handle_error and 
> such) are now inside the GuC not i915.
> 
> Without the heartbeat going ping, there would be no context switching 
> and thus no pre-emption, no pre-emption timeout and so no hang and reset 
> recovery. And GuC cannot sent pulses by itself - it has no ability to 
> generate context workloads. So we need i915 to send the pings and to 
> gradually raise their priority. But the back half of the heartbeat code 
> is now inside the GuC. It will simply never reach the i915 side timeout 
> if GuC is doing the recovery (unless the heartbeat's final period is too 
> short). We should only reach the i915 side timeout if GuC itself is 
> toast. At which point we need the full GT reset to recover the GuC.

If workload is not preempting and reset does not work, like engine is truly stuck, does the current code hit "stopped heartbeat" or not in GuC mode?

Regards,

Tvrtko

  reply	other threads:[~2022-02-25 17:40 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 21:33 [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-02-18 21:33 ` [Intel-gfx] " John.C.Harrison
2022-02-18 21:33 ` [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22  9:52   ` Tvrtko Ursulin
2022-02-22 10:39     ` Tvrtko Ursulin
2022-02-23  2:11     ` John Harrison
2022-02-23 12:13       ` Tvrtko Ursulin
2022-02-23 19:03         ` John Harrison
2022-02-24  9:59           ` Tvrtko Ursulin
2022-02-24 19:19             ` John Harrison
2022-02-24 19:51               ` John Harrison
2022-02-25 17:44                 ` Tvrtko Ursulin
2022-02-25 17:06               ` Tvrtko Ursulin
2022-02-25 17:39                 ` John Harrison
2022-02-28 16:11                   ` Tvrtko Ursulin
2022-02-28 18:32                     ` John Harrison
2022-03-01 10:50                       ` Tvrtko Ursulin
2022-03-01 19:57                         ` John Harrison
2022-03-02  9:20                           ` Tvrtko Ursulin
2022-03-02 18:07                             ` John Harrison
2022-02-23  0:52   ` Ceraolo Spurio, Daniele
2022-02-23  2:15     ` John Harrison
2022-02-18 21:33 ` [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22 11:19   ` Tvrtko Ursulin
2022-02-23  2:45     ` John Harrison
2022-02-23 13:58       ` Tvrtko Ursulin
2022-02-23 20:00         ` John Harrison
2022-02-24 11:41           ` Tvrtko Ursulin
2022-02-24 19:45             ` John Harrison
2022-02-25 18:14               ` Tvrtko Ursulin
2022-02-25 18:48                 ` John Harrison
2022-02-28 17:12                   ` Tvrtko Ursulin
2022-02-28 18:55                     ` John Harrison
2022-03-01 12:09                       ` Tvrtko Ursulin
2022-03-01 20:59                         ` John Harrison
2022-03-02 11:07                           ` Tvrtko Ursulin
2022-03-02 17:55                             ` John Harrison
2022-03-03  9:55                               ` Tvrtko Ursulin
2022-03-03 19:09                                 ` John Harrison
2022-03-04 12:36                                   ` Tvrtko Ursulin
2022-02-18 21:33 ` [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-19  2:54 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Improve anti-pre-emption w/a for compute workloads Patchwork
2022-02-19  3:33 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-02-22  9:53 ` [Intel-gfx] [PATCH 0/3] " Tvrtko Ursulin
2022-02-23  2:22   ` John Harrison
2022-02-23 12:00     ` Tvrtko Ursulin
2022-02-24 20:02       ` John Harrison
2022-02-25 16:36         ` Tvrtko Ursulin
2022-02-25 17:11           ` John Harrison
2022-02-25 17:39             ` Tvrtko Ursulin [this message]
2022-02-25 18:01               ` John Harrison
2022-02-25 18:29                 ` Tvrtko Ursulin
2022-02-25 19:03                   ` John Harrison
2022-02-28 15:32                     ` Tvrtko Ursulin
2022-02-28 19:17                       ` John Harrison
2022-03-02 11:21                         ` Tvrtko Ursulin
2022-03-02 17:40                           ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=720fe7f8-a712-c755-9fbe-70ac2fa008f6@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=john.c.harrison@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.