All of lore.kernel.org
 help / color / mirror / Atom feed
From: John.C.Harrison@Intel.com
To: Intel-GFX@Lists.FreeDesktop.Org
Cc: Michal Mrozek <michal.mrozek@intel.com>,
	Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>,
	John Harrison <John.C.Harrison@Intel.com>,
	DRI-Devel@Lists.FreeDesktop.Org
Subject: [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission
Date: Fri, 18 Feb 2022 13:33:07 -0800	[thread overview]
Message-ID: <20220218213307.1338478-4-John.C.Harrison@Intel.com> (raw)
In-Reply-To: <20220218213307.1338478-1-John.C.Harrison@Intel.com>

From: John Harrison <John.C.Harrison@Intel.com>

A workaround was added to the driver to allow OpenCL workloads to run
'forever' by disabling pre-emption on the RCS engine for Gen12.
It is not totally unbound as the heartbeat will kick in eventually
and cause a reset of the hung engine.

However, this does not work well in GuC submission mode. In GuC mode,
the pre-emption timeout is how GuC detects hung contexts and triggers
a per engine reset. Thus, disabling the timeout means also losing all
per engine reset ability. A full GT reset will still occur when the
heartbeat finally expires, but that is a much more destructive and
undesirable mechanism.

The purpose of the workaround is actually to give OpenCL tasks longer
to reach a pre-emption point after a pre-emption request has been
issued. This is necessary because Gen12 does not support mid-thread
pre-emption and OpenCL can have long running threads.

So, rather than disabling the timeout completely, just set it to a
'long' value.

CC: Michal Mrozek <michal.mrozek@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2a1e9f36e6f5..64249301a227 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -385,9 +385,25 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	engine->props.timeslice_duration_ms =
 		CONFIG_DRM_I915_TIMESLICE_DURATION;
 
-	/* Override to uninterruptible for OpenCL workloads. */
-	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
-		engine->props.preempt_timeout_ms = 0;
+	/*
+	 * Mid-thread pre-emption is not available in Gen12. Unfortunately,
+	 * some OpenCL workloads run quite long threads. That means they get
+	 * reset due to not pre-empting in a timely manner. So, bump the
+	 * pre-emption timeout value to be much higher for compute engines.
+	 * Using three times the heartbeat period seems long enough for a
+	 * reasonable task to reach a pre-emption point but not so long as to
+	 * allow genuine hangs to go unresolved.
+	 */
+	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) {
+		unsigned long triple_beat = engine->props.heartbeat_interval_ms * 3;
+
+		if (triple_beat > engine->props.preempt_timeout_ms) {
+			drm_info(&gt->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n",
+				 engine->props.preempt_timeout_ms, triple_beat, engine->name);
+
+			engine->props.preempt_timeout_ms = triple_beat;
+		}
+	}
 
 	/* Cap timeouts to prevent overflow inside GuC */
 	if (intel_guc_submission_is_wanted(&gt->uc.guc)) {
-- 
2.25.1


WARNING: multiple messages have this Message-ID (diff)
From: John.C.Harrison@Intel.com
To: Intel-GFX@Lists.FreeDesktop.Org
Cc: Michal Mrozek <michal.mrozek@intel.com>, DRI-Devel@Lists.FreeDesktop.Org
Subject: [Intel-gfx] [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission
Date: Fri, 18 Feb 2022 13:33:07 -0800	[thread overview]
Message-ID: <20220218213307.1338478-4-John.C.Harrison@Intel.com> (raw)
In-Reply-To: <20220218213307.1338478-1-John.C.Harrison@Intel.com>

From: John Harrison <John.C.Harrison@Intel.com>

A workaround was added to the driver to allow OpenCL workloads to run
'forever' by disabling pre-emption on the RCS engine for Gen12.
It is not totally unbound as the heartbeat will kick in eventually
and cause a reset of the hung engine.

However, this does not work well in GuC submission mode. In GuC mode,
the pre-emption timeout is how GuC detects hung contexts and triggers
a per engine reset. Thus, disabling the timeout means also losing all
per engine reset ability. A full GT reset will still occur when the
heartbeat finally expires, but that is a much more destructive and
undesirable mechanism.

The purpose of the workaround is actually to give OpenCL tasks longer
to reach a pre-emption point after a pre-emption request has been
issued. This is necessary because Gen12 does not support mid-thread
pre-emption and OpenCL can have long running threads.

So, rather than disabling the timeout completely, just set it to a
'long' value.

CC: Michal Mrozek <michal.mrozek@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2a1e9f36e6f5..64249301a227 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -385,9 +385,25 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	engine->props.timeslice_duration_ms =
 		CONFIG_DRM_I915_TIMESLICE_DURATION;
 
-	/* Override to uninterruptible for OpenCL workloads. */
-	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
-		engine->props.preempt_timeout_ms = 0;
+	/*
+	 * Mid-thread pre-emption is not available in Gen12. Unfortunately,
+	 * some OpenCL workloads run quite long threads. That means they get
+	 * reset due to not pre-empting in a timely manner. So, bump the
+	 * pre-emption timeout value to be much higher for compute engines.
+	 * Using three times the heartbeat period seems long enough for a
+	 * reasonable task to reach a pre-emption point but not so long as to
+	 * allow genuine hangs to go unresolved.
+	 */
+	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) {
+		unsigned long triple_beat = engine->props.heartbeat_interval_ms * 3;
+
+		if (triple_beat > engine->props.preempt_timeout_ms) {
+			drm_info(&gt->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n",
+				 engine->props.preempt_timeout_ms, triple_beat, engine->name);
+
+			engine->props.preempt_timeout_ms = triple_beat;
+		}
+	}
 
 	/* Cap timeouts to prevent overflow inside GuC */
 	if (intel_guc_submission_is_wanted(&gt->uc.guc)) {
-- 
2.25.1


  parent reply	other threads:[~2022-02-18 21:33 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 21:33 [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-02-18 21:33 ` [Intel-gfx] " John.C.Harrison
2022-02-18 21:33 ` [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22  9:52   ` Tvrtko Ursulin
2022-02-22 10:39     ` Tvrtko Ursulin
2022-02-23  2:11     ` John Harrison
2022-02-23 12:13       ` Tvrtko Ursulin
2022-02-23 19:03         ` John Harrison
2022-02-24  9:59           ` Tvrtko Ursulin
2022-02-24 19:19             ` John Harrison
2022-02-24 19:51               ` John Harrison
2022-02-25 17:44                 ` Tvrtko Ursulin
2022-02-25 17:06               ` Tvrtko Ursulin
2022-02-25 17:39                 ` John Harrison
2022-02-28 16:11                   ` Tvrtko Ursulin
2022-02-28 18:32                     ` John Harrison
2022-03-01 10:50                       ` Tvrtko Ursulin
2022-03-01 19:57                         ` John Harrison
2022-03-02  9:20                           ` Tvrtko Ursulin
2022-03-02 18:07                             ` John Harrison
2022-02-23  0:52   ` Ceraolo Spurio, Daniele
2022-02-23  2:15     ` John Harrison
2022-02-18 21:33 ` [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-02-18 21:33   ` [Intel-gfx] " John.C.Harrison
2022-02-22 11:19   ` Tvrtko Ursulin
2022-02-23  2:45     ` John Harrison
2022-02-23 13:58       ` Tvrtko Ursulin
2022-02-23 20:00         ` John Harrison
2022-02-24 11:41           ` Tvrtko Ursulin
2022-02-24 19:45             ` John Harrison
2022-02-25 18:14               ` Tvrtko Ursulin
2022-02-25 18:48                 ` John Harrison
2022-02-28 17:12                   ` Tvrtko Ursulin
2022-02-28 18:55                     ` John Harrison
2022-03-01 12:09                       ` Tvrtko Ursulin
2022-03-01 20:59                         ` John Harrison
2022-03-02 11:07                           ` Tvrtko Ursulin
2022-03-02 17:55                             ` John Harrison
2022-03-03  9:55                               ` Tvrtko Ursulin
2022-03-03 19:09                                 ` John Harrison
2022-03-04 12:36                                   ` Tvrtko Ursulin
2022-02-18 21:33 ` John.C.Harrison [this message]
2022-02-18 21:33   ` [Intel-gfx] [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission John.C.Harrison
2022-02-19  2:54 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Improve anti-pre-emption w/a for compute workloads Patchwork
2022-02-19  3:33 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-02-22  9:53 ` [Intel-gfx] [PATCH 0/3] " Tvrtko Ursulin
2022-02-23  2:22   ` John Harrison
2022-02-23 12:00     ` Tvrtko Ursulin
2022-02-24 20:02       ` John Harrison
2022-02-25 16:36         ` Tvrtko Ursulin
2022-02-25 17:11           ` John Harrison
2022-02-25 17:39             ` Tvrtko Ursulin
2022-02-25 18:01               ` John Harrison
2022-02-25 18:29                 ` Tvrtko Ursulin
2022-02-25 19:03                   ` John Harrison
2022-02-28 15:32                     ` Tvrtko Ursulin
2022-02-28 19:17                       ` John Harrison
2022-03-02 11:21                         ` Tvrtko Ursulin
2022-03-02 17:40                           ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220218213307.1338478-4-John.C.Harrison@Intel.com \
    --to=john.c.harrison@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=michal.mrozek@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.