All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	Matthew Auld <matthew.auld@intel.com>,
	dri-devel@lists.freedesktop.org,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Subject: [PATCH 6/7] drm/i915: Fail too long user submissions by default
Date: Wed, 24 Mar 2021 12:13:34 +0000	[thread overview]
Message-ID: <20210324121335.2307063-7-tvrtko.ursulin@linux.intel.com> (raw)
In-Reply-To: <20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com>

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
to 20s, and this timeout is applied to all users contexts using the
previously added watchdog facility.

Result of this is that any user submission will simply fail after this
timeout, either causing a reset (for non-preemptable), or incomplete
results.

This can have an effect that workloads which used to work fine will
suddenly start failing. Even workloads comprised of short batches but in
long dependency chains can be terminated.

And because of lack of agreement on usefulness and safety of fence error
propagation this partial execution can be invisible to userspace even if
it is "listening" to returned fence status.

Another interaction is with hangcheck where care needs to be taken timeout
is not set lower or close to three times the heartbeat interval. Otherwise
a hang in any application can cause complete termination of all
submissions from unrelated clients. Any users modifying the per engine
heartbeat intervals therefore need to be aware of this potential denial of
service to avoid inadvertently enabling it.

Given all this I am personally not convinced the scheme is a good idea.
Intuitively it feels object importers would be better positioned to
enforce the time they are willing to wait for something to complete.

v2:
 * Improved commit message and Kconfig text.
 * Pull in some helper code from patch which got dropped.

v3:
 * Bump timeout to 20s to see if it helps Tigerlake.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile          | 14 +++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 38 +++++++++++++++++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  4 ++
 drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +++++-
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 35bbe2b80596..39328567c200 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -1,3 +1,17 @@
+config DRM_I915_REQUEST_TIMEOUT
+	int "Default timeout for requests (ms)"
+	default 20000 # milliseconds
+	help
+	  Configures the default timeout after which any user submissions will
+	  be forcefully terminated.
+
+	  Beware setting this value lower, or close to heartbeat interval
+	  rounded to whole seconds times three, in order to avoid allowing
+	  misbehaving applications causing total rendering failure in unrelated
+	  clients.
+
+	  May be 0 to disable the timeout.
+
 config DRM_I915_FENCE_TIMEOUT
 	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
 	default 10000 # milliseconds
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 03a2f5f2a11f..33ff1a6a7724 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -233,6 +233,8 @@ static void intel_context_set_gem(struct intel_context *ce,
 	if (ctx->sched.priority >= I915_PRIORITY_NORMAL &&
 	    intel_engine_has_timeslices(ce->engine))
 		__set_bit(CONTEXT_USE_SEMAPHORES, &ce->flags);
+
+	intel_context_set_watchdog_us(ce, ctx->watchdog.timeout_us);
 }
 
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
@@ -820,6 +822,40 @@ static void __assign_timeline(struct i915_gem_context *ctx,
 	context_apply_all(ctx, __apply_timeline, timeline);
 }
 
+static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
+{
+	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
+}
+
+static int
+__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
+{
+	int ret;
+
+	ret = context_apply_all(ctx, __apply_watchdog,
+				(void *)(uintptr_t)timeout_us);
+	if (!ret)
+		ctx->watchdog.timeout_us = timeout_us;
+
+	return ret;
+}
+
+static void __set_default_fence_expiry(struct i915_gem_context *ctx)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	int ret;
+
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+		return;
+
+	/* Default expiry for user fences. */
+	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	if (ret)
+		drm_notice(&i915->drm,
+			   "Failed to configure default fence expiry! (%d)",
+			   ret);
+}
+
 static struct i915_gem_context *
 i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 {
@@ -864,6 +900,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 		intel_timeline_put(timeline);
 	}
 
+	__set_default_fence_expiry(ctx);
+
 	trace_i915_context_create(ctx);
 
 	return ctx;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index d5bc75508048..f17da7e26c43 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 	 */
 	atomic_t active_count;
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	/**
 	 * @hang_timestamp: The last time(s) this context caused a GPU hang
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_context_param.h b/drivers/gpu/drm/i915/gt/intel_context_param.h
index f053d8633fe2..3ecacc675f41 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_param.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_param.h
@@ -6,9 +6,18 @@
 #ifndef INTEL_CONTEXT_PARAM_H
 #define INTEL_CONTEXT_PARAM_H
 
-struct intel_context;
+#include <linux/types.h>
+
+#include "intel_context.h"
 
 int intel_context_set_ring_size(struct intel_context *ce, long sz);
 long intel_context_get_ring_size(struct intel_context *ce);
 
+static inline int
+intel_context_set_watchdog_us(struct intel_context *ce, u64 timeout_us)
+{
+	ce->watchdog.timeout_us = timeout_us;
+	return 0;
+}
+
 #endif /* INTEL_CONTEXT_PARAM_H */
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	Matthew Auld <matthew.auld@intel.com>,
	dri-devel@lists.freedesktop.org
Subject: [Intel-gfx] [PATCH 6/7] drm/i915: Fail too long user submissions by default
Date: Wed, 24 Mar 2021 12:13:34 +0000	[thread overview]
Message-ID: <20210324121335.2307063-7-tvrtko.ursulin@linux.intel.com> (raw)
In-Reply-To: <20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com>

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
to 20s, and this timeout is applied to all users contexts using the
previously added watchdog facility.

Result of this is that any user submission will simply fail after this
timeout, either causing a reset (for non-preemptable), or incomplete
results.

This can have an effect that workloads which used to work fine will
suddenly start failing. Even workloads comprised of short batches but in
long dependency chains can be terminated.

And because of lack of agreement on usefulness and safety of fence error
propagation this partial execution can be invisible to userspace even if
it is "listening" to returned fence status.

Another interaction is with hangcheck where care needs to be taken timeout
is not set lower or close to three times the heartbeat interval. Otherwise
a hang in any application can cause complete termination of all
submissions from unrelated clients. Any users modifying the per engine
heartbeat intervals therefore need to be aware of this potential denial of
service to avoid inadvertently enabling it.

Given all this I am personally not convinced the scheme is a good idea.
Intuitively it feels object importers would be better positioned to
enforce the time they are willing to wait for something to complete.

v2:
 * Improved commit message and Kconfig text.
 * Pull in some helper code from patch which got dropped.

v3:
 * Bump timeout to 20s to see if it helps Tigerlake.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile          | 14 +++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 38 +++++++++++++++++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  4 ++
 drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +++++-
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 35bbe2b80596..39328567c200 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -1,3 +1,17 @@
+config DRM_I915_REQUEST_TIMEOUT
+	int "Default timeout for requests (ms)"
+	default 20000 # milliseconds
+	help
+	  Configures the default timeout after which any user submissions will
+	  be forcefully terminated.
+
+	  Beware setting this value lower, or close to heartbeat interval
+	  rounded to whole seconds times three, in order to avoid allowing
+	  misbehaving applications causing total rendering failure in unrelated
+	  clients.
+
+	  May be 0 to disable the timeout.
+
 config DRM_I915_FENCE_TIMEOUT
 	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
 	default 10000 # milliseconds
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 03a2f5f2a11f..33ff1a6a7724 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -233,6 +233,8 @@ static void intel_context_set_gem(struct intel_context *ce,
 	if (ctx->sched.priority >= I915_PRIORITY_NORMAL &&
 	    intel_engine_has_timeslices(ce->engine))
 		__set_bit(CONTEXT_USE_SEMAPHORES, &ce->flags);
+
+	intel_context_set_watchdog_us(ce, ctx->watchdog.timeout_us);
 }
 
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
@@ -820,6 +822,40 @@ static void __assign_timeline(struct i915_gem_context *ctx,
 	context_apply_all(ctx, __apply_timeline, timeline);
 }
 
+static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
+{
+	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
+}
+
+static int
+__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
+{
+	int ret;
+
+	ret = context_apply_all(ctx, __apply_watchdog,
+				(void *)(uintptr_t)timeout_us);
+	if (!ret)
+		ctx->watchdog.timeout_us = timeout_us;
+
+	return ret;
+}
+
+static void __set_default_fence_expiry(struct i915_gem_context *ctx)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	int ret;
+
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+		return;
+
+	/* Default expiry for user fences. */
+	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	if (ret)
+		drm_notice(&i915->drm,
+			   "Failed to configure default fence expiry! (%d)",
+			   ret);
+}
+
 static struct i915_gem_context *
 i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 {
@@ -864,6 +900,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 		intel_timeline_put(timeline);
 	}
 
+	__set_default_fence_expiry(ctx);
+
 	trace_i915_context_create(ctx);
 
 	return ctx;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index d5bc75508048..f17da7e26c43 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 	 */
 	atomic_t active_count;
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	/**
 	 * @hang_timestamp: The last time(s) this context caused a GPU hang
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_context_param.h b/drivers/gpu/drm/i915/gt/intel_context_param.h
index f053d8633fe2..3ecacc675f41 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_param.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_param.h
@@ -6,9 +6,18 @@
 #ifndef INTEL_CONTEXT_PARAM_H
 #define INTEL_CONTEXT_PARAM_H
 
-struct intel_context;
+#include <linux/types.h>
+
+#include "intel_context.h"
 
 int intel_context_set_ring_size(struct intel_context *ce, long sz);
 long intel_context_get_ring_size(struct intel_context *ce);
 
+static inline int
+intel_context_set_watchdog_us(struct intel_context *ce, u64 timeout_us)
+{
+	ce->watchdog.timeout_us = timeout_us;
+	return 0;
+}
+
 #endif /* INTEL_CONTEXT_PARAM_H */
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2021-03-24 12:14 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 12:13 [PATCH v4 0/7] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:13 ` [PATCH 1/7] drm/i915: Extract active lookup engine to a helper Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:21   ` Matthew Auld
2021-03-24 12:21     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 2/7] drm/i915: Individual request cancellation Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 15:24   ` Matthew Auld
2021-03-24 15:24     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 3/7] drm/i915: Restrict sentinel requests further Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 15:25   ` Matthew Auld
2021-03-24 15:25     ` [Intel-gfx] " Matthew Auld
2021-03-26  0:01     ` Daniel Vetter
2021-03-26  0:01       ` Daniel Vetter
2021-03-24 12:13 ` [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 17:22   ` Matthew Auld
2021-03-24 17:22     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 5/7] drm/i915: Request watchdog infrastructure Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-26  0:00   ` Daniel Vetter
2021-03-26  0:00     ` [Intel-gfx] " Daniel Vetter
2021-03-26 10:32     ` Tvrtko Ursulin
2021-03-26 10:32       ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:13 ` Tvrtko Ursulin [this message]
2021-03-24 12:13   ` [Intel-gfx] [PATCH 6/7] drm/i915: Fail too long user submissions by default Tvrtko Ursulin
2021-03-24 12:13 ` [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-26  0:25   ` Daniel Vetter
2021-03-26  0:25     ` [Intel-gfx] " Daniel Vetter
2021-03-24 13:16 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog (rev5) Patchwork
2021-03-24 13:21 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-03-24 13:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-03-24 23:29 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-03-26  9:10 ` [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Daniel Vetter
2021-03-26  9:10   ` Daniel Vetter
2021-03-26 10:31   ` Tvrtko Ursulin
2021-03-26 10:31     ` Tvrtko Ursulin
2021-04-08 10:18     ` Daniel Vetter
2021-04-08 10:18       ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210324121335.2307063-7-tvrtko.ursulin@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=Intel-gfx@lists.freedesktop.org \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=tvrtko.ursulin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.