All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Subject: [PATCH 12/16] drm/i915/gem: Cancel contexts when hangchecking is disabled
Date: Mon, 21 Oct 2019 09:02:22 +0100	[thread overview]
Message-ID: <20191021080226.537-12-chris@chris-wilson.co.uk> (raw)
In-Reply-To: <20191021080226.537-1-chris@chris-wilson.co.uk>

Normally, we rely on our hangcheck to prevent persistent batches from
hogging the GPU. However, if the user disables hangcheck, this mechanism
breaks down. Despite our insistence that this is unsafe, the users are
equally insistent that they want to use endless batches and will disable
the hangcheck mechanism. We are looking at replacing hangcheck, in the
next patch, with a softer mechanism, that sends a pulse down the engine
to check if it is well. We can use the same preemptive pulse to flush an
active context off the GPU upon context close, preventing resources
being lost and unkillable requests remaining on the GPU after process
termination.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 141 ++++++++++++++++++++
 1 file changed, 141 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 7b01f4605f21..b2f042d87be0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -70,6 +70,7 @@
 #include <drm/i915_drm.h>
 
 #include "gt/intel_lrc_reg.h"
+#include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_engine_user.h"
 
 #include "i915_gem_context.h"
@@ -276,6 +277,135 @@ void i915_gem_context_release(struct kref *ref)
 		schedule_work(&gc->free_work);
 }
 
+static inline struct i915_gem_engines *
+__context_engines_static(const struct i915_gem_context *ctx)
+{
+	return rcu_dereference_protected(ctx->engines, true);
+}
+
+static bool __reset_engine(struct intel_engine_cs *engine)
+{
+	struct intel_gt *gt = engine->gt;
+	bool success = false;
+
+	if (!intel_has_reset_engine(gt))
+		return false;
+
+	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
+			      &gt->reset.flags)) {
+		success = intel_engine_reset(engine, NULL) == 0;
+		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
+				      &gt->reset.flags);
+	}
+
+	return success;
+}
+
+static void __reset_context(struct i915_gem_context *ctx,
+			    struct intel_engine_cs *engine)
+{
+	intel_gt_handle_error(engine->gt, engine->mask, 0,
+			      "context closure in %s", ctx->name);
+}
+
+static bool __cancel_engine(struct intel_engine_cs *engine)
+{
+	/*
+	 * Send a "high priority pulse" down the engine to cause the
+	 * current request to be momentarily preempted. (If it fails to
+	 * be preempted, it will be reset). As we have marked our context
+	 * as banned, any incomplete request, including any running, will
+	 * be skipped following the preemption.
+	 *
+	 * If there is no hangchecking (one of the reasons why we try to
+	 * cancel the context) and no forced preemption, there may be no
+	 * means by which we reset the GPU and evict the persistent hog.
+	 * Ergo if we are unable to inject a preemptive pulse that can
+	 * kill the banned context, we fallback to doing a local reset
+	 * instead.
+	 */
+	if (CONFIG_DRM_I915_PREEMPT_TIMEOUT && !intel_engine_pulse(engine))
+		return true;
+
+	/* If we are unable to send a pulse, try resetting this engine. */
+	return __reset_engine(engine);
+}
+
+static struct intel_engine_cs *
+active_engine(struct dma_fence *fence, struct intel_context *ce)
+{
+	struct i915_request *rq = to_request(fence);
+	struct intel_engine_cs *engine, *locked;
+
+	/*
+	 * Serialise with __i915_request_submit() so that it sees
+	 * is-banned?, or we know the request is already inflight.
+	 */
+	locked = READ_ONCE(rq->engine);
+	spin_lock_irq(&locked->active.lock);
+	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
+		spin_unlock(&locked->active.lock);
+		spin_lock(&engine->active.lock);
+		locked = engine;
+	}
+
+	engine = NULL;
+	if (i915_request_is_active(rq) && !rq->fence.error)
+		engine = rq->engine;
+
+	spin_unlock_irq(&locked->active.lock);
+
+	return engine;
+}
+
+static void kill_context(struct i915_gem_context *ctx)
+{
+	struct i915_gem_engines_iter it;
+	struct intel_context *ce;
+
+	/*
+	 * If we are already banned, it was due to a guilty request causing
+	 * a reset and the entire context being evicted from the GPU.
+	 */
+	if (i915_gem_context_is_banned(ctx))
+		return;
+
+	i915_gem_context_set_banned(ctx);
+
+	/*
+	 * Map the user's engine back to the actual engines; one virtual
+	 * engine will be mapped to multiple engines, and using ctx->engine[]
+	 * the same engine may be have multiple instances in the user's map.
+	 * However, we only care about pending requests, so only include
+	 * engines on which there are incomplete requests.
+	 */
+	for_each_gem_engine(ce, __context_engines_static(ctx), it) {
+		struct intel_engine_cs *engine;
+		struct dma_fence *fence;
+
+		if (!ce->timeline)
+			continue;
+
+		fence = i915_active_fence_get(&ce->timeline->last_request);
+		if (!fence)
+			continue;
+
+		/* Check with the backend if the request is still inflight */
+		engine = active_engine(fence, ce);
+
+		/* First attempt to gracefully cancel the context */
+		if (engine && !__cancel_engine(engine))
+			/*
+			 * If we are unable to send a preemptive pulse to bump
+			 * the context from the GPU, we have to resort to a full
+			 * reset. We hope the collateral damage is worth it.
+			 */
+			__reset_context(ctx, engine);
+
+		dma_fence_put(fence);
+	}
+}
+
 static void context_close(struct i915_gem_context *ctx)
 {
 	struct i915_address_space *vm;
@@ -298,6 +428,17 @@ static void context_close(struct i915_gem_context *ctx)
 	lut_close(ctx);
 
 	mutex_unlock(&ctx->mutex);
+
+	/*
+	 * If the user has disabled hangchecking, we can not be sure that
+	 * the batches will ever complete after the context is closed,
+	 * keeping the context and all resources pinned forever. So in this
+	 * case we opt to forcibly kill off all remaining requests on
+	 * context close.
+	 */
+	if (!i915_modparams.enable_hangcheck)
+		kill_context(ctx);
+
 	i915_gem_context_put(ctx);
 }
 
-- 
2.24.0.rc0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2019-10-21  8:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-21  8:02 [PATCH 01/16] drm/i915: Don't set queue_priority_hint if we don't kick the submission Chris Wilson
2019-10-21  8:02 ` [PATCH 02/16] drm/i915: Drop assertion that ce->pin_mutex guards state updates Chris Wilson
2019-10-22 12:25   ` Mika Kuoppala
2019-10-21  8:02 ` [PATCH 03/16] drm/i915/selftests: Add coverage of mocs registers Chris Wilson
2019-10-21  8:02 ` [PATCH 04/16] drm/i915/selftests: Teach igt_flush_test and igt_live_test to take intel_gt Chris Wilson
2019-10-21  8:02 ` [PATCH 05/16] drm/i915/selftests: Force ordering of context switches Chris Wilson
2019-10-21  8:02 ` [PATCH 06/16] drm/i915: Expose engine properties via sysfs Chris Wilson
2019-10-21  8:02 ` [PATCH 07/16] drm/i915: Expose engine->mmio_base " Chris Wilson
2019-10-21  8:02 ` [PATCH 08/16] drm/i915: Expose timeslice duration to sysfs Chris Wilson
2019-10-21  8:02 ` [PATCH 09/16] drm/i915/execlists: Force preemption Chris Wilson
2019-10-21  8:02 ` [PATCH 10/16] drm/i915/gt: Introduce barrier pulses along engines Chris Wilson
2019-10-21  8:02 ` [PATCH 11/16] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
2019-10-21  8:02 ` Chris Wilson [this message]
2019-10-21  8:02 ` [PATCH 13/16] drm/i915: Replace hangcheck by heartbeats Chris Wilson
2019-10-21  8:02 ` [PATCH 14/16] drm/i915/gem: Make context persistence optional Chris Wilson
2019-10-21  8:02 ` [PATCH 15/16] drm/i915/gem: Distinguish each object type Chris Wilson
2019-10-22 14:30   ` Matthew Auld
2019-11-04 17:51     ` Daniel Vetter
2019-11-04 17:51       ` [Intel-gfx] " Daniel Vetter
2019-10-21  8:02 ` [PATCH 16/16] drm/i915: Flush idle barriers when waiting Chris Wilson
2019-10-21  8:54 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/16] drm/i915: Don't set queue_priority_hint if we don't kick the submission Patchwork
2019-10-21  9:01 ` ✗ Fi.CI.SPARSE: " Patchwork
2019-10-21  9:18 ` ✗ Fi.CI.BAT: failure " Patchwork
2019-10-21  9:49 ` [PATCH 01/16] " Mika Kuoppala
2019-10-21  9:56   ` Chris Wilson
2019-10-21 10:01     ` Mika Kuoppala

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191021080226.537-12-chris@chris-wilson.co.uk \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.