All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Subject: Re: [RFC 5/6] drm/i915: Fail too long user submissions by default
Date: Tue, 16 Mar 2021 11:10:54 +0100	[thread overview]
Message-ID: <YFCELgxgfy70w68A@phenom.ffwll.local> (raw)
In-Reply-To: <20210312154622.1767865-6-tvrtko.ursulin@linux.intel.com>

On Fri, Mar 12, 2021 at 03:46:21PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
> to 10s, and this timeout is applied to _all_ contexts using the previously
> added watchdog facility.
> 
> Result of this is that any user submission will simply fail after this
> time, either causing a reset (for non-preemptable) or incomplete results.
> 
> This can have an effect that workloads which used to work fine will
> suddenly start failing.
> 
> When the default expiry is active userspace will not be allowed to
> decrease the timeout using the context param setting.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

I think this should explain that it will break long running compute
workloads, and that maybe the modparam in the next patch can paper over
that until we've implemented proper long running compute workload support
in upstream. Which is unfortunately still some ways off.

Otherwise makes all sense to me. Maybe if you want also copy some of the
discussion from your cover letter into this commit message, and think
there's some good stuff there.
-Daniel

> ---
>  drivers/gpu/drm/i915/Kconfig.profile        |  8 ++++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c | 47 ++++++++++++++++++---
>  2 files changed, 48 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 35bbe2b80596..55e157ffff73 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -1,3 +1,11 @@
> +config DRM_I915_REQUEST_TIMEOUT
> +	int "Default timeout for requests (ms)"
> +	default 10000 # milliseconds
> +	help
> +	  ...
> +
> +	  May be 0 to disable the timeout.
> +
>  config DRM_I915_FENCE_TIMEOUT
>  	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
>  	default 10000 # milliseconds
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 32b05af4fc8f..21c0176e27a0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -854,6 +854,25 @@ static void __assign_timeline(struct i915_gem_context *ctx,
>  	context_apply_all(ctx, __apply_timeline, timeline);
>  }
>  
> +static int
> +__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us);
> +
> +static void __set_default_fence_expiry(struct i915_gem_context *ctx)
> +{
> +	struct drm_i915_private *i915 = ctx->i915;
> +	int ret;
> +
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
> +		return;
> +
> +	/* Default expiry for user fences. */
> +	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
> +	if (ret)
> +		drm_notice(&i915->drm,
> +			   "Failed to configure default fence expiry! (%d)",
> +			   ret);
> +}
> +
>  static struct i915_gem_context *
>  i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
>  {
> @@ -898,6 +917,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
>  		intel_timeline_put(timeline);
>  	}
>  
> +	__set_default_fence_expiry(ctx);
> +
>  	trace_i915_context_create(ctx);
>  
>  	return ctx;
> @@ -1404,23 +1425,35 @@ static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
>  	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
>  }
>  
> -static int set_watchdog(struct i915_gem_context *ctx,
> -			struct drm_i915_gem_context_param *args)
> +static int
> +__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
>  {
>  	int ret;
>  
> -	if (args->size)
> -		return -EINVAL;
> -
>  	ret = context_apply_all(ctx, __apply_watchdog,
> -				(void *)(uintptr_t)args->value);
> +				(void *)(uintptr_t)timeout_us);
>  
>  	if (!ret)
> -		ctx->watchdog.timeout_us = args->value;
> +		ctx->watchdog.timeout_us = timeout_us;
>  
>  	return ret;
>  }
>  
> +static int set_watchdog(struct i915_gem_context *ctx,
> +			struct drm_i915_gem_context_param *args)
> +{
> +	if (args->size)
> +		return -EINVAL;
> +
> +	/* Disallow disabling or configuring longer watchdog than default. */
> +	if (IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) &&
> +	    (!args->value ||
> +	     args->value > CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000))
> +		return -EPERM;
> +
> +	return __set_watchdog(ctx, args->value);
> +}
> +
>  static int __get_ringsize(struct intel_context *ce, void *arg)
>  {
>  	long sz;
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Daniel Vetter <daniel@ffwll.ch>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 5/6] drm/i915: Fail too long user submissions by default
Date: Tue, 16 Mar 2021 11:10:54 +0100	[thread overview]
Message-ID: <YFCELgxgfy70w68A@phenom.ffwll.local> (raw)
In-Reply-To: <20210312154622.1767865-6-tvrtko.ursulin@linux.intel.com>

On Fri, Mar 12, 2021 at 03:46:21PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
> to 10s, and this timeout is applied to _all_ contexts using the previously
> added watchdog facility.
> 
> Result of this is that any user submission will simply fail after this
> time, either causing a reset (for non-preemptable) or incomplete results.
> 
> This can have an effect that workloads which used to work fine will
> suddenly start failing.
> 
> When the default expiry is active userspace will not be allowed to
> decrease the timeout using the context param setting.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

I think this should explain that it will break long running compute
workloads, and that maybe the modparam in the next patch can paper over
that until we've implemented proper long running compute workload support
in upstream. Which is unfortunately still some ways off.

Otherwise makes all sense to me. Maybe if you want also copy some of the
discussion from your cover letter into this commit message, and think
there's some good stuff there.
-Daniel

> ---
>  drivers/gpu/drm/i915/Kconfig.profile        |  8 ++++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c | 47 ++++++++++++++++++---
>  2 files changed, 48 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 35bbe2b80596..55e157ffff73 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -1,3 +1,11 @@
> +config DRM_I915_REQUEST_TIMEOUT
> +	int "Default timeout for requests (ms)"
> +	default 10000 # milliseconds
> +	help
> +	  ...
> +
> +	  May be 0 to disable the timeout.
> +
>  config DRM_I915_FENCE_TIMEOUT
>  	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
>  	default 10000 # milliseconds
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 32b05af4fc8f..21c0176e27a0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -854,6 +854,25 @@ static void __assign_timeline(struct i915_gem_context *ctx,
>  	context_apply_all(ctx, __apply_timeline, timeline);
>  }
>  
> +static int
> +__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us);
> +
> +static void __set_default_fence_expiry(struct i915_gem_context *ctx)
> +{
> +	struct drm_i915_private *i915 = ctx->i915;
> +	int ret;
> +
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
> +		return;
> +
> +	/* Default expiry for user fences. */
> +	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
> +	if (ret)
> +		drm_notice(&i915->drm,
> +			   "Failed to configure default fence expiry! (%d)",
> +			   ret);
> +}
> +
>  static struct i915_gem_context *
>  i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
>  {
> @@ -898,6 +917,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
>  		intel_timeline_put(timeline);
>  	}
>  
> +	__set_default_fence_expiry(ctx);
> +
>  	trace_i915_context_create(ctx);
>  
>  	return ctx;
> @@ -1404,23 +1425,35 @@ static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
>  	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
>  }
>  
> -static int set_watchdog(struct i915_gem_context *ctx,
> -			struct drm_i915_gem_context_param *args)
> +static int
> +__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
>  {
>  	int ret;
>  
> -	if (args->size)
> -		return -EINVAL;
> -
>  	ret = context_apply_all(ctx, __apply_watchdog,
> -				(void *)(uintptr_t)args->value);
> +				(void *)(uintptr_t)timeout_us);
>  
>  	if (!ret)
> -		ctx->watchdog.timeout_us = args->value;
> +		ctx->watchdog.timeout_us = timeout_us;
>  
>  	return ret;
>  }
>  
> +static int set_watchdog(struct i915_gem_context *ctx,
> +			struct drm_i915_gem_context_param *args)
> +{
> +	if (args->size)
> +		return -EINVAL;
> +
> +	/* Disallow disabling or configuring longer watchdog than default. */
> +	if (IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) &&
> +	    (!args->value ||
> +	     args->value > CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000))
> +		return -EPERM;
> +
> +	return __set_watchdog(ctx, args->value);
> +}
> +
>  static int __get_ringsize(struct intel_context *ce, void *arg)
>  {
>  	long sz;
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2021-03-16 10:10 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-12 15:46 [RFC 0/6] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-03-12 15:46 ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 1/6] drm/i915: Individual request cancellation Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-15 17:37   ` Tvrtko Ursulin
2021-03-15 17:37     ` Tvrtko Ursulin
2021-03-16 10:02     ` Daniel Vetter
2021-03-16 10:02       ` Daniel Vetter
2021-03-12 15:46 ` [RFC 2/6] drm/i915: Restrict sentinel requests further Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 3/6] drm/i915: Request watchdog infrastructure Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 4/6] drm/i915: Allow userspace to configure the watchdog Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:09   ` Daniel Vetter
2021-03-16 10:09     ` [Intel-gfx] " Daniel Vetter
2021-03-12 15:46 ` [RFC 5/6] drm/i915: Fail too long user submissions by default Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:10   ` Daniel Vetter [this message]
2021-03-16 10:10     ` Daniel Vetter
2021-03-12 15:46 ` [RFC 6/6] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:03   ` Daniel Vetter
2021-03-16 10:03     ` [Intel-gfx] " Daniel Vetter
2021-03-12 16:22 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog Patchwork
2021-03-12 16:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-03-12 18:25 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YFCELgxgfy70w68A@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=Intel-gfx@lists.freedesktop.org \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=tvrtko.ursulin@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.