All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 1/6] drm/i915: Individual request cancellation
Date: Mon, 15 Mar 2021 17:37:27 +0000	[thread overview]
Message-ID: <f361804a-2c51-77ee-dbb4-0caba6bfffd0@linux.intel.com> (raw)
In-Reply-To: <20210312154622.1767865-2-tvrtko.ursulin@linux.intel.com>


On 12/03/2021 15:46, Tvrtko Ursulin wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Currently, we cancel outstanding requests within a context when the
> context is closed. We may also want to cancel individual requests using
> the same graceful preemption mechanism.
> 
> v2 (Tvrtko):
>   * Cancel waiters carefully considering no timeline lock and RCU.
>   * Fixed selftests.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

[snip]

> +void i915_request_cancel(struct i915_request *rq, int error)
> +{
> +	if (!i915_request_set_error_once(rq, error))
> +		return;
> +
> +	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
> +
> +	if (i915_sw_fence_signaled(&rq->submit)) {
> +		struct i915_dependency *p;
> +
> +restart:
> +		rcu_read_lock();
> +		for_each_waiter(p, rq) {
> +			struct i915_request *w =
> +				container_of(p->waiter, typeof(*w), sched);
> +
> +			if (__i915_request_is_complete(w) ||
> +			    fatal_error(w->fence.error))
> +				continue;
> +
> +			w = i915_request_get(w);
> +			rcu_read_unlock();
> +			/* Recursion bound by the number of engines */
> +			i915_request_cancel(w, error);
> +			i915_request_put(w);
> +
> +			/* Restart after having to drop rcu lock. */
> +			goto restart;
> +		}

So I need to fix this error propagation to waiters in order to avoid 
potential stack overflow caught in shards (gem_ctx_ringsize).

Or alternatively we decide not to propagate fence errors. Not sure that 
consequences either way are particularly better or worse. Things will 
break anyway since what userspace inspects for unexpected fence errors?!

So rendering corruption more or less. Can it cause a further stream of 
GPU hangs I am not sure. Only if there is a inter-engine data dependency 
involving data more complex than images/textures.

Regards,

Tvrtko

> +		rcu_read_unlock();
> +	}
> +
> +	__cancel_request(rq);
> +}
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Intel-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 1/6] drm/i915: Individual request cancellation
Date: Mon, 15 Mar 2021 17:37:27 +0000	[thread overview]
Message-ID: <f361804a-2c51-77ee-dbb4-0caba6bfffd0@linux.intel.com> (raw)
In-Reply-To: <20210312154622.1767865-2-tvrtko.ursulin@linux.intel.com>


On 12/03/2021 15:46, Tvrtko Ursulin wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Currently, we cancel outstanding requests within a context when the
> context is closed. We may also want to cancel individual requests using
> the same graceful preemption mechanism.
> 
> v2 (Tvrtko):
>   * Cancel waiters carefully considering no timeline lock and RCU.
>   * Fixed selftests.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

[snip]

> +void i915_request_cancel(struct i915_request *rq, int error)
> +{
> +	if (!i915_request_set_error_once(rq, error))
> +		return;
> +
> +	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
> +
> +	if (i915_sw_fence_signaled(&rq->submit)) {
> +		struct i915_dependency *p;
> +
> +restart:
> +		rcu_read_lock();
> +		for_each_waiter(p, rq) {
> +			struct i915_request *w =
> +				container_of(p->waiter, typeof(*w), sched);
> +
> +			if (__i915_request_is_complete(w) ||
> +			    fatal_error(w->fence.error))
> +				continue;
> +
> +			w = i915_request_get(w);
> +			rcu_read_unlock();
> +			/* Recursion bound by the number of engines */
> +			i915_request_cancel(w, error);
> +			i915_request_put(w);
> +
> +			/* Restart after having to drop rcu lock. */
> +			goto restart;
> +		}

So I need to fix this error propagation to waiters in order to avoid 
potential stack overflow caught in shards (gem_ctx_ringsize).

Or alternatively we decide not to propagate fence errors. Not sure that 
consequences either way are particularly better or worse. Things will 
break anyway since what userspace inspects for unexpected fence errors?!

So rendering corruption more or less. Can it cause a further stream of 
GPU hangs I am not sure. Only if there is a inter-engine data dependency 
involving data more complex than images/textures.

Regards,

Tvrtko

> +		rcu_read_unlock();
> +	}
> +
> +	__cancel_request(rq);
> +}
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2021-03-15 17:37 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-12 15:46 [RFC 0/6] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-03-12 15:46 ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 1/6] drm/i915: Individual request cancellation Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-15 17:37   ` Tvrtko Ursulin [this message]
2021-03-15 17:37     ` Tvrtko Ursulin
2021-03-16 10:02     ` Daniel Vetter
2021-03-16 10:02       ` Daniel Vetter
2021-03-12 15:46 ` [RFC 2/6] drm/i915: Restrict sentinel requests further Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 3/6] drm/i915: Request watchdog infrastructure Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-12 15:46 ` [RFC 4/6] drm/i915: Allow userspace to configure the watchdog Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:09   ` Daniel Vetter
2021-03-16 10:09     ` [Intel-gfx] " Daniel Vetter
2021-03-12 15:46 ` [RFC 5/6] drm/i915: Fail too long user submissions by default Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:10   ` Daniel Vetter
2021-03-16 10:10     ` [Intel-gfx] " Daniel Vetter
2021-03-12 15:46 ` [RFC 6/6] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin
2021-03-12 15:46   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-16 10:03   ` Daniel Vetter
2021-03-16 10:03     ` [Intel-gfx] " Daniel Vetter
2021-03-12 16:22 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog Patchwork
2021-03-12 16:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-03-12 18:25 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f361804a-2c51-77ee-dbb4-0caba6bfffd0@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=Intel-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.