All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] drm/i915: Only enqueue already completed requests
@ 2019-08-06 13:47 Chris Wilson
  2019-08-06 13:47 ` [PATCH 2/5] drm/i915/execlists: Force preemption Chris Wilson
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 13:47 UTC (permalink / raw)
  To: intel-gfx

If we are asked to submit a completed request, just move it onto the
active-list without modifying it's payload.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_request.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 8ac7d14ec8c9..69fc66bd1125 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -397,6 +397,9 @@ void __i915_request_submit(struct i915_request *request)
 	GEM_BUG_ON(!irqs_disabled());
 	lockdep_assert_held(&engine->active.lock);
 
+	if (i915_request_completed(request))
+		goto xfer;
+
 	if (i915_gem_context_is_banned(request->gem_context))
 		i915_request_skip(request, -EIO);
 
@@ -420,7 +423,13 @@ void __i915_request_submit(struct i915_request *request)
 	    i915_sw_fence_signaled(&request->semaphore))
 		engine->saturated |= request->sched.semaphores;
 
+	engine->emit_fini_breadcrumb(request,
+				     request->ring->vaddr + request->postfix);
+
+	engine->serial++;
+
 	/* We may be recursing from the signal callback of another i915 fence */
+xfer:
 	spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
 
 	list_move_tail(&request->sched.link, &engine->active.requests);
@@ -437,11 +446,6 @@ void __i915_request_submit(struct i915_request *request)
 
 	spin_unlock(&request->lock);
 
-	engine->emit_fini_breadcrumb(request,
-				     request->ring->vaddr + request->postfix);
-
-	engine->serial++;
-
 	trace_i915_request_execute(request);
 }
 
-- 
2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] drm/i915/execlists: Force preemption
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
@ 2019-08-06 13:47 ` Chris Wilson
  2019-08-06 13:47 ` [PATCH 3/5] drm/i915: Mark up "sentinel" requests Chris Wilson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 13:47 UTC (permalink / raw)
  To: intel-gfx

If the preempted context takes too long to relinquish control, e.g. it
is stuck inside a shader with arbitration disabled, evict that context
with an engine reset. This ensures that preemptions are reasonably
responsive, providing a tighter QoS for the more important context at
the cost of flagging unresponsive contexts more frequently (i.e. instead
of using an ~10s hangcheck, we now evict at ~100ms).  The challenge of
lies in picking a timeout that can be reasonably serviced by HW for
typical workloads, balancing the existing clients against the needs for
responsiveness.

Note that coupled with timeslicing, this will lead to rapid GPU "hang"
detection with multiple active contexts vying for GPU time.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile | 12 +++++
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 65 ++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_params.h   |  2 +-
 3 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 48df8889a88a..3184e8491333 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -25,3 +25,15 @@ config DRM_I915_SPIN_REQUEST
 	  May be 0 to disable the initial spin. In practice, we estimate
 	  the cost of enabling the interrupt (if currently disabled) to be
 	  a few microseconds.
+
+config DRM_I915_PREEMPT_TIMEOUT
+	int "Preempt timeout (ms)"
+	default 100 # milliseconds
+	help
+	  How long to wait (in milliseconds) for a preemption event to occur
+	  when submitting a new context via execlists. If the current context
+	  does not hit an arbitration point and yield to HW before the timer
+	  expires, the HW will be reset to allow the more important context
+	  to execute.
+
+	  May be 0 to disable the timeout.
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 232f40fcb490..4e45cd972267 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -948,6 +948,21 @@ static void record_preemption(struct intel_engine_execlists *execlists)
 	(void)I915_SELFTEST_ONLY(execlists->preempt_hang.count++);
 }
 
+static unsigned long preempt_expires(void)
+{
+	const unsigned long timeout =
+		msecs_to_jiffies_timeout(CONFIG_DRM_I915_PREEMPT_TIMEOUT);
+
+	/*
+	 * Paranoia to make sure the compiler computes the timeout before
+	 * loading 'jiffies' as jiffies is volatile and may be updated in
+	 * the background by a timer tick. All to reduce the complexity
+	 * of the addition and reduce the risk of losing a jiffie.
+	 */
+	barrier();
+	return jiffies + timeout;
+}
+
 static void execlists_dequeue(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -1286,6 +1301,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 		*port = execlists_schedule_in(last, port - execlists->pending);
 		memset(port + 1, 0, (last_port - port) * sizeof(*port));
 		execlists_submit_ports(engine);
+		if (CONFIG_DRM_I915_PREEMPT_TIMEOUT)
+			mod_timer(&execlists->timer, preempt_expires());
 	} else {
 		ring_set_paused(engine, 0);
 	}
@@ -1470,13 +1487,48 @@ static void process_csb(struct intel_engine_cs *engine)
 	invalidate_csb_entries(&buf[0], &buf[num_entries - 1]);
 }
 
-static void __execlists_submission_tasklet(struct intel_engine_cs *const engine)
+static bool __execlists_submission_tasklet(struct intel_engine_cs *const engine)
 {
 	lockdep_assert_held(&engine->active.lock);
 
 	process_csb(engine);
-	if (!engine->execlists.pending[0])
+	if (!engine->execlists.pending[0]) {
 		execlists_dequeue(engine);
+		return true;
+	}
+
+	return false;
+}
+
+static noinline void preempt_reset(struct intel_engine_cs *engine)
+{
+	const unsigned int bit = I915_RESET_ENGINE + engine->id;
+	unsigned long *lock = &engine->gt->reset.flags;
+
+	if (i915_modparams.reset < 3)
+		return;
+
+	if (test_and_set_bit(bit, lock))
+		return;
+
+	/* Mark this tasklet as disabled to avoid waiting for it to complete */
+	tasklet_disable_nosync(&engine->execlists.tasklet);
+
+	intel_engine_reset(engine, "preemption time out");
+
+	tasklet_enable(&engine->execlists.tasklet);
+	clear_and_wake_up_bit(bit, lock);
+}
+
+static bool preempt_timeout(struct intel_engine_cs *const engine)
+{
+	if (!CONFIG_DRM_I915_PREEMPT_TIMEOUT)
+		return false;
+
+	if (!intel_engine_has_preemption(engine))
+		return false;
+
+	return !timer_pending(&engine->execlists.timer);
 }
 
 /*
@@ -1487,10 +1539,17 @@ static void execlists_submission_tasklet(unsigned long data)
 {
 	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
 	unsigned long flags;
+	bool reset = false;
 
 	spin_lock_irqsave(&engine->active.lock, flags);
-	__execlists_submission_tasklet(engine);
+
+	if (!__execlists_submission_tasklet(engine) && preempt_timeout(engine))
+		reset = true;
+
 	spin_unlock_irqrestore(&engine->active.lock, flags);
+
+	if (unlikely(reset))
+		preempt_reset(engine);
 }
 
 static void execlists_submission_timer(struct timer_list *timer)
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index d29ade3b7de6..56058978bb27 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -61,7 +61,7 @@ struct drm_printer;
 	param(char *, dmc_firmware_path, NULL) \
 	param(int, mmio_debug, -IS_ENABLED(CONFIG_DRM_I915_DEBUG_MMIO)) \
 	param(int, edp_vswing, 0) \
-	param(int, reset, 2) \
+	param(int, reset, 3) \
 	param(unsigned int, inject_load_failure, 0) \
 	param(int, fastboot, -1) \
 	param(int, enable_dpcd_backlight, 0) \
-- 
2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] drm/i915: Mark up "sentinel" requests
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
  2019-08-06 13:47 ` [PATCH 2/5] drm/i915/execlists: Force preemption Chris Wilson
@ 2019-08-06 13:47 ` Chris Wilson
  2019-08-06 14:29   ` Mika Kuoppala
  2019-08-06 13:47 ` [PATCH 4/5] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 13:47 UTC (permalink / raw)
  To: intel-gfx

Sometimes we want to emit a terminator request, a request that flushes
the pipeline and allows no request to come after it. This can be used
for a "preempt-to-idle" to ensure that upon processing the
context-switch to that request, all other active contexts have been
flushed.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c |  6 ++++++
 drivers/gpu/drm/i915/i915_request.h | 10 ++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4e45cd972267..59a7e4eb7e2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -759,6 +759,9 @@ static bool can_merge_rq(const struct i915_request *prev,
 	GEM_BUG_ON(prev == next);
 	GEM_BUG_ON(!assert_priority_queue(prev, next));
 
+	if (i915_request_has_sentinel(prev))
+		return false;
+
 	if (!can_merge_ctx(prev->hw_context, next->hw_context))
 		return false;
 
@@ -1250,6 +1253,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 				if (last->hw_context == rq->hw_context)
 					goto done;
 
+				if (i915_request_has_sentinel(last))
+					goto done;
+
 				/*
 				 * If GVT overrides us we only ever submit
 				 * port[0], leaving port[1] empty. Note that we
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 313df3c37158..c41ecbe0bd0c 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -217,8 +217,9 @@ struct i915_request {
 	unsigned long emitted_jiffies;
 
 	unsigned long flags;
-#define I915_REQUEST_WAITBOOST BIT(0)
-#define I915_REQUEST_NOPREEMPT BIT(1)
+#define I915_REQUEST_WAITBOOST	BIT(0)
+#define I915_REQUEST_NOPREEMPT	BIT(1)
+#define I915_REQUEST_SENTINEL	BIT(2)
 
 	/** timeline->request entry for this request */
 	struct list_head link;
@@ -443,6 +444,11 @@ static inline bool i915_request_has_nopreempt(const struct i915_request *rq)
 	return unlikely(rq->flags & I915_REQUEST_NOPREEMPT);
 }
 
+static inline bool i915_request_has_sentinel(const struct i915_request *rq)
+{
+	return unlikely(rq->flags & I915_REQUEST_SENTINEL);
+}
+
 bool i915_retire_requests(struct drm_i915_private *i915);
 
 #endif /* I915_REQUEST_H */
-- 
2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] drm/i915/execlists: Cancel banned contexts on schedule-out
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
  2019-08-06 13:47 ` [PATCH 2/5] drm/i915/execlists: Force preemption Chris Wilson
  2019-08-06 13:47 ` [PATCH 3/5] drm/i915: Mark up "sentinel" requests Chris Wilson
@ 2019-08-06 13:47 ` Chris Wilson
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 13:47 UTC (permalink / raw)
  To: intel-gfx

On completion of a banned context, scrub the context image so that we do
not replay the active payload. The intent is that we skip banned
payloads on request submission so that the timeline advancement
continues on in the background. However, if we are returning to a
preempted request, i915_request_skip() is ineffective and instead we
need to patch up the context image so that it continues from the start
of the next request.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 39 +++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 59a7e4eb7e2a..3f166e799398 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -222,6 +222,9 @@ static void execlists_init_reg_state(u32 *reg_state,
 				     struct intel_context *ce,
 				     struct intel_engine_cs *engine,
 				     struct intel_ring *ring);
+static void
+__execlists_update_reg_state(struct intel_context *ce,
+			     struct intel_engine_cs *engine);
 
 static inline u32 intel_hws_preempt_address(struct intel_engine_cs *engine)
 {
@@ -575,6 +578,39 @@ static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 		tasklet_schedule(&ve->base.execlists.tasklet);
 }
 
+static void cancel_active(struct intel_context *ce, struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = ce->inflight;
+	u32 *regs = ce->lrc_reg_state;
+
+	if (i915_request_completed(rq))
+		return;
+
+	GEM_TRACE("%s(%s): { rq=%llx:%lld }\n",
+		  __func__, engine->name, rq->fence.context, rq->fence.seqno);
+
+	/* Scrub the context image to prevent replaying the previous batch */
+	if (engine->pinned_default_state) {
+		memcpy(regs, /* skip restoring the vanilla PPHWSP */
+		       engine->pinned_default_state + LRC_STATE_PN * PAGE_SIZE,
+		       engine->context_size - PAGE_SIZE);
+	}
+	execlists_init_reg_state(regs, ce, engine, ce->ring);
+
+	/* Ring will be advanced on retire; here we need to reset the context */
+	ce->ring->head = intel_ring_wrap(ce->ring, rq->wa_tail);
+	__execlists_update_reg_state(ce, engine);
+
+	/* We've switched away, so this should be a no-op, but intent matters */
+	ce->lrc_desc |= CTX_DESC_FORCE_RESTORE;
+
+	/* Let everyone know that the request may now be retired */
+	rq->fence.error = -EIO;
+	*(u32 *)ce->ring->timeline->hwsp_seqno = rq->fence.seqno;
+	GEM_BUG_ON(!i915_request_completed(rq));
+	intel_engine_queue_breadcrumbs(engine);
+}
+
 static inline void
 execlists_schedule_out(struct i915_request *rq)
 {
@@ -589,6 +625,9 @@ execlists_schedule_out(struct i915_request *rq)
 		intel_engine_context_out(ce->inflight);
 		execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
 
+		if (unlikely(i915_gem_context_is_banned(ce->gem_context)))
+			cancel_active(ce, rq);
+
 		/*
 		 * If this is part of a virtual engine, its next request may
 		 * have been blocked waiting for access to the active context.
-- 
2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
                   ` (2 preceding siblings ...)
  2019-08-06 13:47 ` [PATCH 4/5] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
@ 2019-08-06 13:47 ` Chris Wilson
  2019-08-06 14:11   ` Chris Wilson
                     ` (3 more replies)
  2019-08-06 14:25 ` [PATCH 1/5] drm/i915: Only enqueue already completed requests Mika Kuoppala
  2019-08-06 14:51 ` ✗ Fi.CI.BAT: failure for series starting with [1/5] " Patchwork
  5 siblings, 4 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 13:47 UTC (permalink / raw)
  To: intel-gfx

Normally, we rely on our hangcheck to prevent persistent batches from
hogging the GPU. However, if the user disables hangcheck, this mechanism
breaks down. Despite our insistence that this is unsafe, the users are
equally insistent that they want to use endless batches and will disable
the hangcheck mechanism. We are looking are perhaps replacing hangcheck
with a softer mechanism, that sends a pulse down the engine to check if
it is well. We can use the same preemptive pulse to flush an active
persistent context off the GPU upon context close, preventing resources
being lost and unkillable requests remaining on the GPU, after process
termination. To avoid changing the ABI and accidentally breaking
existing userspace, we make the persistence of a context explicit and
enable it by default. Userspace can opt out of persistent mode (forcing
requests to be cancelled when the context is closed by process
termination or explicitly) by a context parameter, or to facilitate
existing use-cases by disabling hangcheck (i915.enable_hangcheck=0).
(Note, one of the outcomes for supporting endless mode will be the
removal of hangchecking, at which point opting into persistent mode will
be mandatory, or maybe the default.)

Testcase: igt/gem_ctx_persistence
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>

---
Same sort of caveats as for hangcheck, a few corner cases need
struct_mutex and some preallocation.
---
 drivers/gpu/drm/i915/Makefile                 |  3 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 79 +++++++++++++++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 53 +++++++++++++
 .../gpu/drm/i915/gt/intel_engine_heartbeat.h  | 14 ++++
 include/uapi/drm/i915_drm.h                   | 15 ++++
 7 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index a1016858d014..42417d87496e 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -71,9 +71,10 @@ obj-y += gt/
 gt-y += \
 	gt/intel_breadcrumbs.o \
 	gt/intel_context.o \
-	gt/intel_engine_pool.o \
 	gt/intel_engine_cs.o \
+	gt/intel_engine_heartbeat.o \
 	gt/intel_engine_pm.o \
+	gt/intel_engine_pool.o \
 	gt/intel_gt.o \
 	gt/intel_gt_pm.o \
 	gt/intel_hangcheck.o \
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 64f7a533e886..5718b74f95b7 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -70,6 +70,7 @@
 #include <drm/i915_drm.h>
 
 #include "gt/intel_lrc_reg.h"
+#include "gt/intel_engine_heartbeat.h"
 
 #include "i915_gem_context.h"
 #include "i915_globals.h"
@@ -373,6 +374,45 @@ void i915_gem_context_release(struct kref *ref)
 		queue_work(i915->wq, &i915->contexts.free_work);
 }
 
+static void kill_context(struct i915_gem_context *ctx)
+{
+	struct i915_gem_engines_iter it;
+	struct intel_engine_cs *engine;
+	intel_engine_mask_t tmp, active;
+	struct intel_context *ce;
+
+	if (i915_gem_context_is_banned(ctx))
+		return;
+
+	i915_gem_context_set_banned(ctx);
+
+	active = 0;
+	for_each_gem_engine(ce, i915_gem_context_lock_engines(ctx), it) {
+		struct i915_request *rq;
+
+		if (!ce->ring)
+			continue;
+
+		rq = i915_active_request_get_unlocked(&ce->ring->timeline->last_request);
+		if (!rq)
+			continue;
+
+		active |= rq->engine->mask;
+		i915_request_put(rq);
+	}
+	i915_gem_context_unlock_engines(ctx);
+
+	/*
+	 * Send a "high priority pulse" down the engine to cause the
+	 * current request to be momentarily preempted. (If it fails to
+	 * be preempted, it will be reset). As we have marked our context
+	 * as banned, any incomplete request, including any running, will
+	 * be skipped.
+	 */
+	for_each_engine_masked(engine, ctx->i915, active, tmp)
+		intel_engine_pulse(engine);
+}
+
 static void context_close(struct i915_gem_context *ctx)
 {
 	mutex_lock(&ctx->mutex);
@@ -394,6 +434,15 @@ static void context_close(struct i915_gem_context *ctx)
 	lut_close(ctx);
 
 	mutex_unlock(&ctx->mutex);
+
+	/*
+	 * If the user has disabled hangchecking, we can not be sure that
+	 * the batches will ever complete and let the context be freed.
+	 * Forcibly kill off any remaining requests in this case.
+	 */
+	if (!i915_gem_context_is_persistent(ctx))
+		kill_context(ctx);
+
 	i915_gem_context_put(ctx);
 }
 
@@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
 
 	i915_gem_context_set_bannable(ctx);
 	i915_gem_context_set_recoverable(ctx);
+	if (i915_modparams.enable_hangcheck)
+		i915_gem_context_set_persistence(ctx);
 
 	ctx->ring_size = 4 * PAGE_SIZE;
 
@@ -1745,6 +1796,25 @@ get_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+static int
+set_persistence(struct i915_gem_context *ctx,
+		const struct drm_i915_gem_context_param *args)
+{
+	if (args->size)
+		return -EINVAL;
+
+	if (!args->value) {
+		i915_gem_context_clear_persistence(ctx);
+		return 0;
+	}
+
+	if (!(ctx->i915->caps.scheduler & I915_SCHEDULER_CAP_PREEMPTION))
+		return -ENODEV;
+
+	i915_gem_context_set_persistence(ctx);
+	return 0;
+}
+
 static int ctx_setparam(struct drm_i915_file_private *fpriv,
 			struct i915_gem_context *ctx,
 			struct drm_i915_gem_context_param *args)
@@ -1822,6 +1892,10 @@ static int ctx_setparam(struct drm_i915_file_private *fpriv,
 		ret = set_engines(ctx, args);
 		break;
 
+	case I915_CONTEXT_PARAM_PERSISTENCE:
+		ret = set_persistence(ctx, args);
+		break;
+
 	case I915_CONTEXT_PARAM_BAN_PERIOD:
 	default:
 		ret = -EINVAL;
@@ -2278,6 +2352,11 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 		ret = get_engines(ctx, args);
 		break;
 
+	case I915_CONTEXT_PARAM_PERSISTENCE:
+		args->size = 0;
+		args->value = i915_gem_context_is_persistent(ctx);
+		break;
+
 	case I915_CONTEXT_PARAM_BAN_PERIOD:
 	default:
 		ret = -EINVAL;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h
index 106e2ccf7a4c..1834d1df2ac9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
@@ -74,6 +74,21 @@ static inline void i915_gem_context_clear_recoverable(struct i915_gem_context *c
 	clear_bit(UCONTEXT_RECOVERABLE, &ctx->user_flags);
 }
 
+static inline bool i915_gem_context_is_persistent(const struct i915_gem_context *ctx)
+{
+	return test_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
+}
+
+static inline void i915_gem_context_set_persistence(struct i915_gem_context *ctx)
+{
+	set_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
+}
+
+static inline void i915_gem_context_clear_persistence(struct i915_gem_context *ctx)
+{
+	clear_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
+}
+
 static inline bool i915_gem_context_is_banned(const struct i915_gem_context *ctx)
 {
 	return test_bit(CONTEXT_BANNED, &ctx->flags);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index a02d98494078..bf41b43ffa9a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -137,6 +137,7 @@ struct i915_gem_context {
 #define UCONTEXT_NO_ERROR_CAPTURE	1
 #define UCONTEXT_BANNABLE		2
 #define UCONTEXT_RECOVERABLE		3
+#define UCONTEXT_PERSISTENCE		4
 
 	/**
 	 * @flags: small set of booleans
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
new file mode 100644
index 000000000000..0c0632a423a7
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -0,0 +1,53 @@
+/*
+ * SPDX-License-Identifier: MIT
+ *
+ * Copyright © 2019 Intel Corporation
+ */
+
+#include "i915_request.h"
+
+#include "intel_context.h"
+#include "intel_engine_heartbeat.h"
+#include "intel_engine_pm.h"
+#include "intel_engine.h"
+#include "intel_gt.h"
+
+void intel_engine_pulse(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce = engine->kernel_context;
+	struct i915_sched_attr attr = { .priority = INT_MAX };
+	struct i915_request *rq;
+	int err = 0;
+
+	GEM_BUG_ON(!engine->schedule);
+
+	if (!intel_engine_pm_get_if_awake(engine))
+		return;
+
+	mutex_lock(&ce->ring->timeline->mutex);
+
+	intel_context_enter(ce);
+	rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);
+	intel_context_exit(ce);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_unlock;
+	}
+	i915_request_get(rq);
+
+	rq->flags |= I915_REQUEST_SENTINEL;
+	__i915_request_commit(rq);
+
+	local_bh_disable();
+	engine->schedule(rq, &attr);
+	local_bh_enable();
+
+	i915_request_put(rq);
+
+out_unlock:
+	mutex_unlock(&ce->ring->timeline->mutex);
+	intel_context_timeline_unlock(ce);
+	intel_engine_pm_put(engine);
+	if (err) /* XXX must not be allowed to fail */
+		DRM_ERROR("Failed to ping %s, err=%d\n", engine->name, err);
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
new file mode 100644
index 000000000000..86761748dc21
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
@@ -0,0 +1,14 @@
+/*
+ * SPDX-License-Identifier: MIT
+ *
+ * Copyright © 2019 Intel Corporation
+ */
+
+#ifndef INTEL_ENGINE_HEARTBEAT_H
+#define INTEL_ENGINE_HEARTBEAT_H
+
+struct intel_engine_cs;
+
+void intel_engine_pulse(struct intel_engine_cs *engine);
+
+#endif /* INTEL_ENGINE_HEARTBEAT_H */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 469dc512cca3..dbc8691d75d0 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1565,6 +1565,21 @@ struct drm_i915_gem_context_param {
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
+
+/*
+ * I915_CONTEXT_PARAM_PERSISTENCE:
+ *
+ * Allow the context and active rendering to survive the process until
+ * completion. Persistence allows fire-and-forget clients to queue up a
+ * bunch of work, hand the output over to a display server and the quit.
+ * If the context is not marked as persistent, upon closing (either via
+ * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file closure
+ * or process termination), the context and any outstanding requests will be
+ * cancelled (and exported fences for cancelled requests marked as -EIO).
+ *
+ * By default, new contexts allow persistence.
+ */
+#define I915_CONTEXT_PARAM_PERSISTENCE	0xb
 /* Must be kept compact -- no holes and well documented */
 
 	__u64 value;
-- 
2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
@ 2019-08-06 14:11   ` Chris Wilson
  2019-08-06 14:26   ` Bloomfield, Jon
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 14:11 UTC (permalink / raw)
  To: intel-gfx

Quoting Chris Wilson (2019-08-06 14:47:25)
> +static int
> +set_persistence(struct i915_gem_context *ctx,
> +               const struct drm_i915_gem_context_param *args)
> +{
> +       if (args->size)
> +               return -EINVAL;
> +
> +       if (!args->value) {
> +               i915_gem_context_clear_persistence(ctx);
> +               return 0;
> +       }
> +
> +       if (!(ctx->i915->caps.scheduler & I915_SCHEDULER_CAP_PREEMPTION))
> +               return -ENODEV;
> +
> +       i915_gem_context_set_persistence(ctx);
> +       return 0;
> +}

Oops, that is back-to-front. We can only switch off persistent behaviour if
we can do the preempt.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] drm/i915: Only enqueue already completed requests
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
                   ` (3 preceding siblings ...)
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
@ 2019-08-06 14:25 ` Mika Kuoppala
  2019-08-06 14:44   ` Chris Wilson
  2019-08-06 14:51 ` ✗ Fi.CI.BAT: failure for series starting with [1/5] " Patchwork
  5 siblings, 1 reply; 25+ messages in thread
From: Mika Kuoppala @ 2019-08-06 14:25 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> If we are asked to submit a completed request, just move it onto the
> active-list without modifying it's payload.
>

If the seqno is there then there is no need to write
it yes.

But I am not at all certain that I deduced the 'why'
part correctly so please spell it out.

-Mika


> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_request.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 8ac7d14ec8c9..69fc66bd1125 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -397,6 +397,9 @@ void __i915_request_submit(struct i915_request *request)
>  	GEM_BUG_ON(!irqs_disabled());
>  	lockdep_assert_held(&engine->active.lock);
>  
> +	if (i915_request_completed(request))
> +		goto xfer;
> +
>  	if (i915_gem_context_is_banned(request->gem_context))
>  		i915_request_skip(request, -EIO);
>  
> @@ -420,7 +423,13 @@ void __i915_request_submit(struct i915_request *request)
>  	    i915_sw_fence_signaled(&request->semaphore))
>  		engine->saturated |= request->sched.semaphores;
>  
> +	engine->emit_fini_breadcrumb(request,
> +				     request->ring->vaddr + request->postfix);
> +
> +	engine->serial++;
> +
>  	/* We may be recursing from the signal callback of another i915 fence */
> +xfer:
>  	spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
>  
>  	list_move_tail(&request->sched.link, &engine->active.requests);
> @@ -437,11 +446,6 @@ void __i915_request_submit(struct i915_request *request)
>  
>  	spin_unlock(&request->lock);
>  
> -	engine->emit_fini_breadcrumb(request,
> -				     request->ring->vaddr + request->postfix);
> -
> -	engine->serial++;
> -
>  	trace_i915_request_execute(request);
>  }
>  
> -- 
> 2.23.0.rc1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
  2019-08-06 14:11   ` Chris Wilson
@ 2019-08-06 14:26   ` Bloomfield, Jon
  2019-08-06 14:41     ` Chris Wilson
  2019-08-07 13:22   ` Chris Wilson
  2019-08-09 23:34   ` Chris Wilson
  3 siblings, 1 reply; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-06 14:26 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Tuesday, August 6, 2019 6:47 AM
> To: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>; Joonas Lahtinen
> <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>; Bloomfield, Jon <jon.bloomfield@intel.com>
> Subject: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Normally, we rely on our hangcheck to prevent persistent batches from
> hogging the GPU. However, if the user disables hangcheck, this mechanism
> breaks down. Despite our insistence that this is unsafe, the users are
> equally insistent that they want to use endless batches and will disable
> the hangcheck mechanism. We are looking are perhaps replacing hangcheck
> with a softer mechanism, that sends a pulse down the engine to check if
> it is well. We can use the same preemptive pulse to flush an active
> persistent context off the GPU upon context close, preventing resources
> being lost and unkillable requests remaining on the GPU, after process
> termination. To avoid changing the ABI and accidentally breaking
> existing userspace, we make the persistence of a context explicit and
> enable it by default. Userspace can opt out of persistent mode (forcing
> requests to be cancelled when the context is closed by process
> termination or explicitly) by a context parameter, or to facilitate
> existing use-cases by disabling hangcheck (i915.enable_hangcheck=0).
> (Note, one of the outcomes for supporting endless mode will be the
> removal of hangchecking, at which point opting into persistent mode will
> be mandatory, or maybe the default.)
> 
> Testcase: igt/gem_ctx_persistence
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> 
> ---
> Same sort of caveats as for hangcheck, a few corner cases need
> struct_mutex and some preallocation.
> ---
>  drivers/gpu/drm/i915/Makefile                 |  3 +-
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 79 +++++++++++++++++++
>  drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++++
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
>  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 53 +++++++++++++
>  .../gpu/drm/i915/gt/intel_engine_heartbeat.h  | 14 ++++
>  include/uapi/drm/i915_drm.h                   | 15 ++++
>  7 files changed, 179 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index a1016858d014..42417d87496e 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -71,9 +71,10 @@ obj-y += gt/
>  gt-y += \
>  	gt/intel_breadcrumbs.o \
>  	gt/intel_context.o \
> -	gt/intel_engine_pool.o \
>  	gt/intel_engine_cs.o \
> +	gt/intel_engine_heartbeat.o \
>  	gt/intel_engine_pm.o \
> +	gt/intel_engine_pool.o \
>  	gt/intel_gt.o \
>  	gt/intel_gt_pm.o \
>  	gt/intel_hangcheck.o \
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 64f7a533e886..5718b74f95b7 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -70,6 +70,7 @@
>  #include <drm/i915_drm.h>
> 
>  #include "gt/intel_lrc_reg.h"
> +#include "gt/intel_engine_heartbeat.h"
> 
>  #include "i915_gem_context.h"
>  #include "i915_globals.h"
> @@ -373,6 +374,45 @@ void i915_gem_context_release(struct kref *ref)
>  		queue_work(i915->wq, &i915->contexts.free_work);
>  }
> 
> +static void kill_context(struct i915_gem_context *ctx)
> +{
> +	struct i915_gem_engines_iter it;
> +	struct intel_engine_cs *engine;
> +	intel_engine_mask_t tmp, active;
> +	struct intel_context *ce;
> +
> +	if (i915_gem_context_is_banned(ctx))
> +		return;
> +
> +	i915_gem_context_set_banned(ctx);
> +
> +	active = 0;
> +	for_each_gem_engine(ce, i915_gem_context_lock_engines(ctx), it) {
> +		struct i915_request *rq;
> +
> +		if (!ce->ring)
> +			continue;
> +
> +		rq = i915_active_request_get_unlocked(&ce->ring->timeline-
> >last_request);
> +		if (!rq)
> +			continue;
> +
> +		active |= rq->engine->mask;
> +		i915_request_put(rq);
> +	}
> +	i915_gem_context_unlock_engines(ctx);
> +
> +	/*
> +	 * Send a "high priority pulse" down the engine to cause the
> +	 * current request to be momentarily preempted. (If it fails to
> +	 * be preempted, it will be reset). As we have marked our context
> +	 * as banned, any incomplete request, including any running, will
> +	 * be skipped.
> +	 */
> +	for_each_engine_masked(engine, ctx->i915, active, tmp)
> +		intel_engine_pulse(engine);
> +}
> +
>  static void context_close(struct i915_gem_context *ctx)
>  {
>  	mutex_lock(&ctx->mutex);
> @@ -394,6 +434,15 @@ static void context_close(struct i915_gem_context
> *ctx)
>  	lut_close(ctx);
> 
>  	mutex_unlock(&ctx->mutex);
> +
> +	/*
> +	 * If the user has disabled hangchecking, we can not be sure that
> +	 * the batches will ever complete and let the context be freed.
> +	 * Forcibly kill off any remaining requests in this case.
> +	 */

Persistence is independent of hangcheck which makes the above comment unrepresentative of the below code.
Do we even need a comment here, it looks self-explanatory? Instead maybe comment the defaulting state in __createContext() to make the connection between hangcheck and persistence?

> +	if (!i915_gem_context_is_persistent(ctx))
> +		kill_context(ctx);
> +
>  	i915_gem_context_put(ctx);
>  }
> 
> @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> 
>  	i915_gem_context_set_bannable(ctx);
>  	i915_gem_context_set_recoverable(ctx);
> +	if (i915_modparams.enable_hangcheck)
> +		i915_gem_context_set_persistence(ctx);
> 
>  	ctx->ring_size = 4 * PAGE_SIZE;
> 
> @@ -1745,6 +1796,25 @@ get_engines(struct i915_gem_context *ctx,
>  	return err;
>  }
> 
> +static int
> +set_persistence(struct i915_gem_context *ctx,
> +		const struct drm_i915_gem_context_param *args)
> +{
> +	if (args->size)
> +		return -EINVAL;
> +
> +	if (!args->value) {
> +		i915_gem_context_clear_persistence(ctx);
> +		return 0;
> +	}
> +
> +	if (!(ctx->i915->caps.scheduler &
> I915_SCHEDULER_CAP_PREEMPTION))
> +		return -ENODEV;
> +
> +	i915_gem_context_set_persistence(ctx);
> +	return 0;
> +}
> +
>  static int ctx_setparam(struct drm_i915_file_private *fpriv,
>  			struct i915_gem_context *ctx,
>  			struct drm_i915_gem_context_param *args)
> @@ -1822,6 +1892,10 @@ static int ctx_setparam(struct
> drm_i915_file_private *fpriv,
>  		ret = set_engines(ctx, args);
>  		break;
> 
> +	case I915_CONTEXT_PARAM_PERSISTENCE:
> +		ret = set_persistence(ctx, args);
> +		break;
> +
>  	case I915_CONTEXT_PARAM_BAN_PERIOD:
>  	default:
>  		ret = -EINVAL;
> @@ -2278,6 +2352,11 @@ int i915_gem_context_getparam_ioctl(struct
> drm_device *dev, void *data,
>  		ret = get_engines(ctx, args);
>  		break;
> 
> +	case I915_CONTEXT_PARAM_PERSISTENCE:
> +		args->size = 0;
> +		args->value = i915_gem_context_is_persistent(ctx);
> +		break;
> +
>  	case I915_CONTEXT_PARAM_BAN_PERIOD:
>  	default:
>  		ret = -EINVAL;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h
> b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> index 106e2ccf7a4c..1834d1df2ac9 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> @@ -74,6 +74,21 @@ static inline void
> i915_gem_context_clear_recoverable(struct i915_gem_context *c
>  	clear_bit(UCONTEXT_RECOVERABLE, &ctx->user_flags);
>  }
> 
> +static inline bool i915_gem_context_is_persistent(const struct
> i915_gem_context *ctx)
> +{
> +	return test_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
> +static inline void i915_gem_context_set_persistence(struct i915_gem_context
> *ctx)
> +{
> +	set_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
> +static inline void i915_gem_context_clear_persistence(struct
> i915_gem_context *ctx)
> +{
> +	clear_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
>  static inline bool i915_gem_context_is_banned(const struct i915_gem_context
> *ctx)
>  {
>  	return test_bit(CONTEXT_BANNED, &ctx->flags);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index a02d98494078..bf41b43ffa9a 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -137,6 +137,7 @@ struct i915_gem_context {
>  #define UCONTEXT_NO_ERROR_CAPTURE	1
>  #define UCONTEXT_BANNABLE		2
>  #define UCONTEXT_RECOVERABLE		3
> +#define UCONTEXT_PERSISTENCE		4
> 
>  	/**
>  	 * @flags: small set of booleans
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> new file mode 100644
> index 000000000000..0c0632a423a7
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -0,0 +1,53 @@
> +/*
> + * SPDX-License-Identifier: MIT
> + *
> + * Copyright © 2019 Intel Corporation
> + */
> +
> +#include "i915_request.h"
> +
> +#include "intel_context.h"
> +#include "intel_engine_heartbeat.h"
> +#include "intel_engine_pm.h"
> +#include "intel_engine.h"
> +#include "intel_gt.h"
> +
> +void intel_engine_pulse(struct intel_engine_cs *engine)
> +{
> +	struct intel_context *ce = engine->kernel_context;
> +	struct i915_sched_attr attr = { .priority = INT_MAX };
> +	struct i915_request *rq;
> +	int err = 0;
> +
> +	GEM_BUG_ON(!engine->schedule);
> +
> +	if (!intel_engine_pm_get_if_awake(engine))
> +		return;
> +
> +	mutex_lock(&ce->ring->timeline->mutex);
> +
> +	intel_context_enter(ce);
> +	rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);
> +	intel_context_exit(ce);
> +	if (IS_ERR(rq)) {
> +		err = PTR_ERR(rq);
> +		goto out_unlock;
> +	}
> +	i915_request_get(rq);
> +
> +	rq->flags |= I915_REQUEST_SENTINEL;
> +	__i915_request_commit(rq);
> +
> +	local_bh_disable();
> +	engine->schedule(rq, &attr);
> +	local_bh_enable();
> +
> +	i915_request_put(rq);
> +
> +out_unlock:
> +	mutex_unlock(&ce->ring->timeline->mutex);
> +	intel_context_timeline_unlock(ce);
> +	intel_engine_pm_put(engine);
> +	if (err) /* XXX must not be allowed to fail */
> +		DRM_ERROR("Failed to ping %s, err=%d\n", engine->name,
> err);
> +}
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
> b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
> new file mode 100644
> index 000000000000..86761748dc21
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
> @@ -0,0 +1,14 @@
> +/*
> + * SPDX-License-Identifier: MIT
> + *
> + * Copyright © 2019 Intel Corporation
> + */
> +
> +#ifndef INTEL_ENGINE_HEARTBEAT_H
> +#define INTEL_ENGINE_HEARTBEAT_H
> +
> +struct intel_engine_cs;
> +
> +void intel_engine_pulse(struct intel_engine_cs *engine);
> +
> +#endif /* INTEL_ENGINE_HEARTBEAT_H */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 469dc512cca3..dbc8691d75d0 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1565,6 +1565,21 @@ struct drm_i915_gem_context_param {
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
>   */
>  #define I915_CONTEXT_PARAM_ENGINES	0xa
> +
> +/*
> + * I915_CONTEXT_PARAM_PERSISTENCE:
> + *
> + * Allow the context and active rendering to survive the process until
> + * completion. Persistence allows fire-and-forget clients to queue up a
> + * bunch of work, hand the output over to a display server and the quit.
> + * If the context is not marked as persistent, upon closing (either via
> + * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file
> closure
> + * or process termination), the context and any outstanding requests will be
> + * cancelled (and exported fences for cancelled requests marked as -EIO).
> + *
> + * By default, new contexts allow persistence.
> + */
> +#define I915_CONTEXT_PARAM_PERSISTENCE	0xb
>  /* Must be kept compact -- no holes and well documented */
> 
>  	__u64 value;
> --
> 2.23.0.rc1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] drm/i915: Mark up "sentinel" requests
  2019-08-06 13:47 ` [PATCH 3/5] drm/i915: Mark up "sentinel" requests Chris Wilson
@ 2019-08-06 14:29   ` Mika Kuoppala
  2019-08-06 14:47     ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Mika Kuoppala @ 2019-08-06 14:29 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Sometimes we want to emit a terminator request, a request that flushes
> the pipeline and allows no request to come after it. This can be used
> for a "preempt-to-idle" to ensure that upon processing the
> context-switch to that request, all other active contexts have been
> flushed.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/gt/intel_lrc.c |  6 ++++++
>  drivers/gpu/drm/i915/i915_request.h | 10 ++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 4e45cd972267..59a7e4eb7e2a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -759,6 +759,9 @@ static bool can_merge_rq(const struct i915_request *prev,
>  	GEM_BUG_ON(prev == next);
>  	GEM_BUG_ON(!assert_priority_queue(prev, next));
>  
> +	if (i915_request_has_sentinel(prev))
> +		return false;
> +
>  	if (!can_merge_ctx(prev->hw_context, next->hw_context))
>  		return false;
>  
> @@ -1250,6 +1253,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  				if (last->hw_context == rq->hw_context)
>  					goto done;
>  
> +				if (i915_request_has_sentinel(last))
> +					goto done;
> +
>  				/*
>  				 * If GVT overrides us we only ever submit
>  				 * port[0], leaving port[1] empty. Note that we
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 313df3c37158..c41ecbe0bd0c 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -217,8 +217,9 @@ struct i915_request {
>  	unsigned long emitted_jiffies;
>  
>  	unsigned long flags;
> -#define I915_REQUEST_WAITBOOST BIT(0)
> -#define I915_REQUEST_NOPREEMPT BIT(1)
> +#define I915_REQUEST_WAITBOOST	BIT(0)
> +#define I915_REQUEST_NOPREEMPT	BIT(1)
> +#define I915_REQUEST_SENTINEL	BIT(2)

Would it be possible to use 'empty' ie non payloadable
requests as a sentinel? (using the request->postfix).

Or is the advantage here that by attaching it as
a property, you avoid submitting an extra (empty)?

-Mika

>  
>  	/** timeline->request entry for this request */
>  	struct list_head link;
> @@ -443,6 +444,11 @@ static inline bool i915_request_has_nopreempt(const struct i915_request *rq)
>  	return unlikely(rq->flags & I915_REQUEST_NOPREEMPT);
>  }
>  
> +static inline bool i915_request_has_sentinel(const struct i915_request *rq)
> +{
> +	return unlikely(rq->flags & I915_REQUEST_SENTINEL);
> +}
> +
>  bool i915_retire_requests(struct drm_i915_private *i915);
>  
>  #endif /* I915_REQUEST_H */
> -- 
> 2.23.0.rc1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 14:26   ` Bloomfield, Jon
@ 2019-08-06 14:41     ` Chris Wilson
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 14:41 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-06 15:26:07)
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > +
> > +     /*
> > +      * If the user has disabled hangchecking, we can not be sure that
> > +      * the batches will ever complete and let the context be freed.
> > +      * Forcibly kill off any remaining requests in this case.
> > +      */
> 
> Persistence is independent of hangcheck which makes the above comment unrepresentative of the below code.
> Do we even need a comment here, it looks self-explanatory? Instead maybe comment the defaulting state in __createContext() to make the connection between hangcheck and persistence?

Indeed, it can be moved almost verbatim to explain the default behaviour
and expand upon the cgroup wish. (Rather than using hangcheck modparam,
let the sysadmin specify defaults for groups of processes -- planning
for when we remove hangcheck entirely.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] drm/i915: Only enqueue already completed requests
  2019-08-06 14:25 ` [PATCH 1/5] drm/i915: Only enqueue already completed requests Mika Kuoppala
@ 2019-08-06 14:44   ` Chris Wilson
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 14:44 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2019-08-06 15:25:49)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If we are asked to submit a completed request, just move it onto the
> > active-list without modifying it's payload.
> >
> 
> If the seqno is there then there is no need to write
> it yes.
> 
> But I am not at all certain that I deduced the 'why'
> part correctly so please spell it out.

Because if the request is complete, we may have moved the ring->head
during retirement (or in the new reset-on-completion) past the breadcrumb
and generate warnings for appearing to go backwards.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] drm/i915: Mark up "sentinel" requests
  2019-08-06 14:29   ` Mika Kuoppala
@ 2019-08-06 14:47     ` Chris Wilson
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-06 14:47 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2019-08-06 15:29:58)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Sometimes we want to emit a terminator request, a request that flushes
> > the pipeline and allows no request to come after it. This can be used
> > for a "preempt-to-idle" to ensure that upon processing the
> > context-switch to that request, all other active contexts have been
> > flushed.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_lrc.c |  6 ++++++
> >  drivers/gpu/drm/i915/i915_request.h | 10 ++++++++--
> >  2 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 4e45cd972267..59a7e4eb7e2a 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -759,6 +759,9 @@ static bool can_merge_rq(const struct i915_request *prev,
> >       GEM_BUG_ON(prev == next);
> >       GEM_BUG_ON(!assert_priority_queue(prev, next));
> >  
> > +     if (i915_request_has_sentinel(prev))
> > +             return false;
> > +
> >       if (!can_merge_ctx(prev->hw_context, next->hw_context))
> >               return false;
> >  
> > @@ -1250,6 +1253,9 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >                               if (last->hw_context == rq->hw_context)
> >                                       goto done;
> >  
> > +                             if (i915_request_has_sentinel(last))
> > +                                     goto done;
> > +
> >                               /*
> >                                * If GVT overrides us we only ever submit
> >                                * port[0], leaving port[1] empty. Note that we
> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 313df3c37158..c41ecbe0bd0c 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -217,8 +217,9 @@ struct i915_request {
> >       unsigned long emitted_jiffies;
> >  
> >       unsigned long flags;
> > -#define I915_REQUEST_WAITBOOST BIT(0)
> > -#define I915_REQUEST_NOPREEMPT BIT(1)
> > +#define I915_REQUEST_WAITBOOST       BIT(0)
> > +#define I915_REQUEST_NOPREEMPT       BIT(1)
> > +#define I915_REQUEST_SENTINEL        BIT(2)
> 
> Would it be possible to use 'empty' ie non payloadable
> requests as a sentinel? (using the request->postfix).
> 
> Or is the advantage here that by attaching it as
> a property, you avoid submitting an extra (empty)?

To achieve the effect I want (effectively preempt-to-idle!), I need to
fill the ELSP with dummies (N different contexts) to prevent submitting
real requests after the preemption heartbeat. So I want an end-of-thread
marker.

I went through
	I915_REQUEST_EOT
	I915_REQUEST_TERMINATOR
	I915_REQEUST_SENTINEL
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/5] drm/i915: Only enqueue already completed requests
  2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
                   ` (4 preceding siblings ...)
  2019-08-06 14:25 ` [PATCH 1/5] drm/i915: Only enqueue already completed requests Mika Kuoppala
@ 2019-08-06 14:51 ` Patchwork
  5 siblings, 0 replies; 25+ messages in thread
From: Patchwork @ 2019-08-06 14:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/5] drm/i915: Only enqueue already completed requests
URL   : https://patchwork.freedesktop.org/series/64781/
State : failure

== Summary ==

Applying: drm/i915: Only enqueue already completed requests
Applying: drm/i915/execlists: Force preemption
Applying: drm/i915: Mark up "sentinel" requests
Applying: drm/i915/execlists: Cancel banned contexts on schedule-out
Applying: drm/i915: Cancel non-persistent contexts on close
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/Makefile
M	drivers/gpu/drm/i915/gem/i915_gem_context.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/gem/i915_gem_context.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/gem/i915_gem_context.c
Auto-merging drivers/gpu/drm/i915/Makefile
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/Makefile
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0005 drm/i915: Cancel non-persistent contexts on close
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
  2019-08-06 14:11   ` Chris Wilson
  2019-08-06 14:26   ` Bloomfield, Jon
@ 2019-08-07 13:22   ` Chris Wilson
  2019-08-07 14:04     ` Bloomfield, Jon
  2019-08-09 23:34   ` Chris Wilson
  3 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-07 13:22 UTC (permalink / raw)
  To: intel-gfx

Quoting Chris Wilson (2019-08-06 14:47:25)
> @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
>  
>         i915_gem_context_set_bannable(ctx);
>         i915_gem_context_set_recoverable(ctx);
> +       if (i915_modparams.enable_hangcheck)
> +               i915_gem_context_set_persistence(ctx);

I am not fond of this, but from a pragmatic point of view, this does
prevent the issue Jon raised: HW resources being pinned indefinitely
past process termination.

I don't like it because we cannot perform the operation cleanly
everywhere, it requires preemption first and foremost (with a cooperating
submission backend) and per-engine reset. The alternative is to try and
do a full GPU reset if the context is still active. For the sake of the
issue raised, I think that (full reset on older HW) is required.

That we are baking in a change of ABI due to an unsafe modparam is meh.
There are a few more corner cases to deal with before endless just
works. But since it is being used in the wild, I'm not sure we can wait
for userspace to opt-in, or wait for cgroups. However, since users are
being encouraged to disable hangcheck, should we extend the concept of
persistence to also mean "survives hangcheck"? -- though it should be a
separate parameter, and I'm not sure how exactly to protect it from the
heartbeat reset without giving gross privileges to the context. (CPU
isolation is nicer from the pov where we can just give up and not even
worry about the engine. If userspace can request isolation, it has the
privilege to screw up.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 13:22   ` Chris Wilson
@ 2019-08-07 14:04     ` Bloomfield, Jon
  2019-08-07 14:14       ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-07 14:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 6:23 AM
> To: intel-gfx@lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>; Bloomfield, Jon <jon.bloomfield@intel.com>
> Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Chris Wilson (2019-08-06 14:47:25)
> > @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> >
> >         i915_gem_context_set_bannable(ctx);
> >         i915_gem_context_set_recoverable(ctx);
> > +       if (i915_modparams.enable_hangcheck)
> > +               i915_gem_context_set_persistence(ctx);
> 
> I am not fond of this, but from a pragmatic point of view, this does
> prevent the issue Jon raised: HW resources being pinned indefinitely
> past process termination.
> 
> I don't like it because we cannot perform the operation cleanly
> everywhere, it requires preemption first and foremost (with a cooperating
> submission backend) and per-engine reset. The alternative is to try and
> do a full GPU reset if the context is still active. For the sake of the
> issue raised, I think that (full reset on older HW) is required.
> 
> That we are baking in a change of ABI due to an unsafe modparam is meh.
> There are a few more corner cases to deal with before endless just
> works. But since it is being used in the wild, I'm not sure we can wait
> for userspace to opt-in, or wait for cgroups. However, since users are
> being encouraged to disable hangcheck, should we extend the concept of
> persistence to also mean "survives hangcheck"? -- though it should be a
> separate parameter, and I'm not sure how exactly to protect it from the
> heartbeat reset without giving gross privileges to the context. (CPU
> isolation is nicer from the pov where we can just give up and not even
> worry about the engine. If userspace can request isolation, it has the
> privilege to screw up.)
> -Chris

Ok, so your concern is supporting non-persistence on older non-preempting, engine-reset capable, hardware. Why is that strictly required? Can't we simply make it dependent on the features needed to do it well, and if your hardware cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone would attempt a HPC type workload on n IVB.

I'm not sure I understand your "survives hangcheck" idea. You mean instead of simply disabling hangcheck, just opt in to persistence and have that also prevent hangcheck? Isn't that the wrong way around, since persistence is what is happening today?

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 14:04     ` Bloomfield, Jon
@ 2019-08-07 14:14       ` Chris Wilson
  2019-08-07 14:33         ` Bloomfield, Jon
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-07 14:14 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-07 15:04:16)
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Wednesday, August 7, 2019 6:23 AM
> > To: intel-gfx@lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> > <michal.winiarski@intel.com>; Bloomfield, Jon <jon.bloomfield@intel.com>
> > Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > 
> > Quoting Chris Wilson (2019-08-06 14:47:25)
> > > @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> > >
> > >         i915_gem_context_set_bannable(ctx);
> > >         i915_gem_context_set_recoverable(ctx);
> > > +       if (i915_modparams.enable_hangcheck)
> > > +               i915_gem_context_set_persistence(ctx);
> > 
> > I am not fond of this, but from a pragmatic point of view, this does
> > prevent the issue Jon raised: HW resources being pinned indefinitely
> > past process termination.
> > 
> > I don't like it because we cannot perform the operation cleanly
> > everywhere, it requires preemption first and foremost (with a cooperating
> > submission backend) and per-engine reset. The alternative is to try and
> > do a full GPU reset if the context is still active. For the sake of the
> > issue raised, I think that (full reset on older HW) is required.
> > 
> > That we are baking in a change of ABI due to an unsafe modparam is meh.
> > There are a few more corner cases to deal with before endless just
> > works. But since it is being used in the wild, I'm not sure we can wait
> > for userspace to opt-in, or wait for cgroups. However, since users are
> > being encouraged to disable hangcheck, should we extend the concept of
> > persistence to also mean "survives hangcheck"? -- though it should be a
> > separate parameter, and I'm not sure how exactly to protect it from the
> > heartbeat reset without giving gross privileges to the context. (CPU
> > isolation is nicer from the pov where we can just give up and not even
> > worry about the engine. If userspace can request isolation, it has the
> > privilege to screw up.)
> 
> Ok, so your concern is supporting non-persistence on older non-preempting, engine-reset capable, hardware. Why is that strictly required? Can't we simply make it dependent on the features needed to do it well, and if your hardware cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone would attempt a HPC type workload on n IVB.

Our advice was not to disable hangcheck :)

With the cat out of the bag, my concern is dotting the Is and crossing
the Ts. Fixing up the error handling path to the reset isn't all that
bad. But I'm not going to advertise the persistence-parameter support
unless we have a clean solution, and we can advise that compute
workloads are better handled with new hardware.
 
> I'm not sure I understand your "survives hangcheck" idea. You mean instead of simply disabling hangcheck, just opt in to persistence and have that also prevent hangcheck? Isn't that the wrong way around, since persistence is what is happening today?

Persistence is the clear-and-present danger. I'm just trying to sketch a
path for endless support, trying to ask myself questions such as: Is the
persistence parameter still required? What other parameters make sense?
Does anything less than CPU-esque isolation make sense? :)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 14:14       ` Chris Wilson
@ 2019-08-07 14:33         ` Bloomfield, Jon
  2019-08-07 15:08           ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-07 14:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 7:14 AM
> To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> gfx@lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>
> Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Bloomfield, Jon (2019-08-07 15:04:16)
> > Ok, so your concern is supporting non-persistence on older non-preempting,
> engine-reset capable, hardware. Why is that strictly required? Can't we simply
> make it dependent on the features needed to do it well, and if your hardware
> cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone
> would attempt a HPC type workload on n IVB.
> 
> Our advice was not to disable hangcheck :)
> 
> With the cat out of the bag, my concern is dotting the Is and crossing
> the Ts. Fixing up the error handling path to the reset isn't all that
> bad. But I'm not going to advertise the persistence-parameter support
> unless we have a clean solution, and we can advise that compute
> workloads are better handled with new hardware.
> 
> > I'm not sure I understand your "survives hangcheck" idea. You mean instead
> of simply disabling hangcheck, just opt in to persistence and have that also
> prevent hangcheck? Isn't that the wrong way around, since persistence is what
> is happening today?
> 
> Persistence is the clear-and-present danger. I'm just trying to sketch a
> path for endless support, trying to ask myself questions such as: Is the
> persistence parameter still required? What other parameters make sense?
> Does anything less than CPU-esque isolation make sense? :)
> -Chris

I personally liked your persistence idea :-)

Isolation doesn't really solve the problem in this case. So, customer enables isolation for a HPC workload. That workload hangs, and user hits ctrl-C. We still have a hung workload and the next job in the queue still can't run.

Also, Isolation is kind of meaningless when there is only one engine that's capable of running your workload. On Gen9, pretty much every type of workload requires some RCS involvement, and RCS is where the compute workloads need to run. So isolation hasn't helped us.

I'd settle for umd opting in to non-persistence and not providing the automatic association with hangcheck. That at least ensures well behaved umd's don't kill the system.

We didn't explore the idea of terminating orphaned contexts though (where none of their resources are referenced by any other contexts). Is there a reason why this is not feasible? In the case of compute (certainly HPC) workloads, there would be no compositor taking the output so this might be a solution.

Jon
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 14:33         ` Bloomfield, Jon
@ 2019-08-07 15:08           ` Chris Wilson
  2019-08-07 15:29             ` Bloomfield, Jon
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-07 15:08 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-07 15:33:51)
[skip to end]
> We didn't explore the idea of terminating orphaned contexts though (where none of their resources are referenced by any other contexts). Is there a reason why this is not feasible? In the case of compute (certainly HPC) workloads, there would be no compositor taking the output so this might be a solution.

Sounds easier said than done. We have to go through each request and
determine it if has an external reference (or if the object holding the
reference has an external reference) to see if the output would be
visible to a third party. Sounds like a conservative GC :| 
(Coming to that conclusion suggests that we should structure the request
tracking to make reparenting easier.)

We could take a pid-1 approach and move all the orphan timelines over to
a new parent purely responsible for them. That honestly doesn't seem to
achieve anything. (We are still stuck with tasks on the GPU and no way
to kill them.)

In comparison, persistence is a rarely used "feature" and cleaning up on
context close fits in nicely with the process model. It just works as
most users/clients would expect. (Although running in non-persistent
by default hasn't show anything to explode on the desktop, it's too easy
to construct scenarios where persistence turns out to be an advantage,
particularly with chains of clients (the compositor model).) Between the
two modes, we should have most bases covered, it's hard to argue for a
third way (that is until someone has a usecase!)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 15:08           ` Chris Wilson
@ 2019-08-07 15:29             ` Bloomfield, Jon
  2019-08-07 15:38               ` Chris Wilson
  2019-08-07 16:51               ` Chris Wilson
  0 siblings, 2 replies; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-07 15:29 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 8:08 AM
> To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> gfx@lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>
> Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Bloomfield, Jon (2019-08-07 15:33:51)
> [skip to end]
> > We didn't explore the idea of terminating orphaned contexts though
> (where none of their resources are referenced by any other contexts). Is
> there a reason why this is not feasible? In the case of compute (certainly
> HPC) workloads, there would be no compositor taking the output so this
> might be a solution.
> 
> Sounds easier said than done. We have to go through each request and
> determine it if has an external reference (or if the object holding the
> reference has an external reference) to see if the output would be
> visible to a third party. Sounds like a conservative GC :|
> (Coming to that conclusion suggests that we should structure the request
> tracking to make reparenting easier.)
> 
> We could take a pid-1 approach and move all the orphan timelines over to
> a new parent purely responsible for them. That honestly doesn't seem to
> achieve anything. (We are still stuck with tasks on the GPU and no way
> to kill them.)
> 
> In comparison, persistence is a rarely used "feature" and cleaning up on
> context close fits in nicely with the process model. It just works as
> most users/clients would expect. (Although running in non-persistent
> by default hasn't show anything to explode on the desktop, it's too easy
> to construct scenarios where persistence turns out to be an advantage,
> particularly with chains of clients (the compositor model).) Between the
> two modes, we should have most bases covered, it's hard to argue for a
> third way (that is until someone has a usecase!)
> -Chris

Ok, makes sense. Thanks.

But have we converged on a decision :-)

As I said, requiring compute umd optin should be ok for the immediate HPC issue, but I'd personally argue that it's valid to change the contract for hangcheck=0 and switch the default to non-persistent.

Jon
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 15:29             ` Bloomfield, Jon
@ 2019-08-07 15:38               ` Chris Wilson
  2019-08-07 16:51               ` Chris Wilson
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-07 15:38 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-07 16:29:55)
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Wednesday, August 7, 2019 8:08 AM
> > To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> > gfx@lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> > <michal.winiarski@intel.com>
> > Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > 
> > Quoting Bloomfield, Jon (2019-08-07 15:33:51)
> > [skip to end]
> > > We didn't explore the idea of terminating orphaned contexts though
> > (where none of their resources are referenced by any other contexts). Is
> > there a reason why this is not feasible? In the case of compute (certainly
> > HPC) workloads, there would be no compositor taking the output so this
> > might be a solution.
> > 
> > Sounds easier said than done. We have to go through each request and
> > determine it if has an external reference (or if the object holding the
> > reference has an external reference) to see if the output would be
> > visible to a third party. Sounds like a conservative GC :|
> > (Coming to that conclusion suggests that we should structure the request
> > tracking to make reparenting easier.)
> > 
> > We could take a pid-1 approach and move all the orphan timelines over to
> > a new parent purely responsible for them. That honestly doesn't seem to
> > achieve anything. (We are still stuck with tasks on the GPU and no way
> > to kill them.)
> > 
> > In comparison, persistence is a rarely used "feature" and cleaning up on
> > context close fits in nicely with the process model. It just works as
> > most users/clients would expect. (Although running in non-persistent
> > by default hasn't show anything to explode on the desktop, it's too easy
> > to construct scenarios where persistence turns out to be an advantage,
> > particularly with chains of clients (the compositor model).) Between the
> > two modes, we should have most bases covered, it's hard to argue for a
> > third way (that is until someone has a usecase!)
> > -Chris
> 
> Ok, makes sense. Thanks.
> 
> But have we converged on a decision :-)
> 
> As I said, requiring compute umd optin should be ok for the immediate HPC issue, but I'd personally argue that it's valid to change the contract for hangcheck=0 and switch the default to non-persistent.

I don't have to like it, but I think that's what we have to do for the
interim 10 years or so.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 15:29             ` Bloomfield, Jon
  2019-08-07 15:38               ` Chris Wilson
@ 2019-08-07 16:51               ` Chris Wilson
  2019-08-07 17:12                 ` Bloomfield, Jon
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-07 16:51 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-07 16:29:55)
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Wednesday, August 7, 2019 8:08 AM
> > To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> > gfx@lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> > <michal.winiarski@intel.com>
> > Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > 
> > Quoting Bloomfield, Jon (2019-08-07 15:33:51)
> > [skip to end]
> > > We didn't explore the idea of terminating orphaned contexts though
> > (where none of their resources are referenced by any other contexts). Is
> > there a reason why this is not feasible? In the case of compute (certainly
> > HPC) workloads, there would be no compositor taking the output so this
> > might be a solution.
> > 
> > Sounds easier said than done. We have to go through each request and
> > determine it if has an external reference (or if the object holding the
> > reference has an external reference) to see if the output would be
> > visible to a third party. Sounds like a conservative GC :|
> > (Coming to that conclusion suggests that we should structure the request
> > tracking to make reparenting easier.)
> > 
> > We could take a pid-1 approach and move all the orphan timelines over to
> > a new parent purely responsible for them. That honestly doesn't seem to
> > achieve anything. (We are still stuck with tasks on the GPU and no way
> > to kill them.)
> > 
> > In comparison, persistence is a rarely used "feature" and cleaning up on
> > context close fits in nicely with the process model. It just works as
> > most users/clients would expect. (Although running in non-persistent
> > by default hasn't show anything to explode on the desktop, it's too easy
> > to construct scenarios where persistence turns out to be an advantage,
> > particularly with chains of clients (the compositor model).) Between the
> > two modes, we should have most bases covered, it's hard to argue for a
> > third way (that is until someone has a usecase!)
> > -Chris
> 
> Ok, makes sense. Thanks.
> 
> But have we converged on a decision :-)
> 
> As I said, requiring compute umd optin should be ok for the immediate HPC issue, but I'd personally argue that it's valid to change the contract for hangcheck=0 and switch the default to non-persistent.

Could you tender

diff --git a/runtime/os_interface/linux/drm_neo.cpp b/runtime/os_interface/linux/drm_neo.cpp
index 31deb68b..8a9af363 100644
--- a/runtime/os_interface/linux/drm_neo.cpp
+++ b/runtime/os_interface/linux/drm_neo.cpp
@@ -141,11 +141,22 @@ void Drm::setLowPriorityContextParam(uint32_t drmContextId) {
     UNRECOVERABLE_IF(retVal != 0);
 }

+void setNonPersistent(uint32_t drmContextId) {
+    drm_i915_gem_context_param gcp = {};
+    gcp.ctx_id = drmContextId;
+    gcp.param = 0xb; /* I915_CONTEXT_PARAM_PERSISTENCE; */
+
+    ioctl(DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &gcp);
+}
+
 uint32_t Drm::createDrmContext() {
     drm_i915_gem_context_create gcc = {};
     auto retVal = ioctl(DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &gcc);
     UNRECOVERABLE_IF(retVal != 0);

+    /* enable cleanup of resources on process termination */
+    setNonPersistent(gcc.ctx_id);
+
     return gcc.ctx_id;
 }

to interested parties?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-07 16:51               ` Chris Wilson
@ 2019-08-07 17:12                 ` Bloomfield, Jon
  0 siblings, 0 replies; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-07 17:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 9:51 AM
> To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> gfx@lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>
> Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Bloomfield, Jon (2019-08-07 16:29:55)
> > > -----Original Message-----
> > > From: Chris Wilson <chris@chris-wilson.co.uk>
> > > Sent: Wednesday, August 7, 2019 8:08 AM
> > > To: Bloomfield, Jon <jon.bloomfield@intel.com>; intel-
> > > gfx@lists.freedesktop.org
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> > > <michal.winiarski@intel.com>
> > > Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > >
> > > Quoting Bloomfield, Jon (2019-08-07 15:33:51)
> > > [skip to end]
> > > > We didn't explore the idea of terminating orphaned contexts though
> > > (where none of their resources are referenced by any other contexts). Is
> > > there a reason why this is not feasible? In the case of compute (certainly
> > > HPC) workloads, there would be no compositor taking the output so this
> > > might be a solution.
> > >
> > > Sounds easier said than done. We have to go through each request and
> > > determine it if has an external reference (or if the object holding the
> > > reference has an external reference) to see if the output would be
> > > visible to a third party. Sounds like a conservative GC :|
> > > (Coming to that conclusion suggests that we should structure the request
> > > tracking to make reparenting easier.)
> > >
> > > We could take a pid-1 approach and move all the orphan timelines over to
> > > a new parent purely responsible for them. That honestly doesn't seem to
> > > achieve anything. (We are still stuck with tasks on the GPU and no way
> > > to kill them.)
> > >
> > > In comparison, persistence is a rarely used "feature" and cleaning up on
> > > context close fits in nicely with the process model. It just works as
> > > most users/clients would expect. (Although running in non-persistent
> > > by default hasn't show anything to explode on the desktop, it's too easy
> > > to construct scenarios where persistence turns out to be an advantage,
> > > particularly with chains of clients (the compositor model).) Between the
> > > two modes, we should have most bases covered, it's hard to argue for a
> > > third way (that is until someone has a usecase!)
> > > -Chris
> >
> > Ok, makes sense. Thanks.
> >
> > But have we converged on a decision :-)
> >
> > As I said, requiring compute umd optin should be ok for the immediate HPC
> issue, but I'd personally argue that it's valid to change the contract for
> hangcheck=0 and switch the default to non-persistent.
> 
> Could you tender
> 
> diff --git a/runtime/os_interface/linux/drm_neo.cpp
> b/runtime/os_interface/linux/drm_neo.cpp
> index 31deb68b..8a9af363 100644
> --- a/runtime/os_interface/linux/drm_neo.cpp
> +++ b/runtime/os_interface/linux/drm_neo.cpp
> @@ -141,11 +141,22 @@ void Drm::setLowPriorityContextParam(uint32_t
> drmContextId) {
>      UNRECOVERABLE_IF(retVal != 0);
>  }
> 
> +void setNonPersistent(uint32_t drmContextId) {
> +    drm_i915_gem_context_param gcp = {};
> +    gcp.ctx_id = drmContextId;
> +    gcp.param = 0xb; /* I915_CONTEXT_PARAM_PERSISTENCE; */
> +
> +    ioctl(DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &gcp);
> +}
> +
>  uint32_t Drm::createDrmContext() {
>      drm_i915_gem_context_create gcc = {};
>      auto retVal = ioctl(DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &gcc);
>      UNRECOVERABLE_IF(retVal != 0);
> 
> +    /* enable cleanup of resources on process termination */
> +    setNonPersistent(gcc.ctx_id);
> +
>      return gcc.ctx_id;
>  }
> 
> to interested parties?
> -Chris
Yes, that's exactly what I had in mind. I think it's enough to resolve the HPC challenges.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
                     ` (2 preceding siblings ...)
  2019-08-07 13:22   ` Chris Wilson
@ 2019-08-09 23:34   ` Chris Wilson
  2019-08-12 14:39     ` Bloomfield, Jon
  3 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2019-08-09 23:34 UTC (permalink / raw)
  To: intel-gfx

Quoting Chris Wilson (2019-08-06 14:47:25)
> Normally, we rely on our hangcheck to prevent persistent batches from
> hogging the GPU. However, if the user disables hangcheck, this mechanism
> breaks down. Despite our insistence that this is unsafe, the users are
> equally insistent that they want to use endless batches and will disable
> the hangcheck mechanism. We are looking are perhaps replacing hangcheck
> with a softer mechanism, that sends a pulse down the engine to check if
> it is well. We can use the same preemptive pulse to flush an active
> persistent context off the GPU upon context close, preventing resources
> being lost and unkillable requests remaining on the GPU, after process
> termination. To avoid changing the ABI and accidentally breaking
> existing userspace, we make the persistence of a context explicit and
> enable it by default. Userspace can opt out of persistent mode (forcing
> requests to be cancelled when the context is closed by process
> termination or explicitly) by a context parameter, or to facilitate
> existing use-cases by disabling hangcheck (i915.enable_hangcheck=0).
> (Note, one of the outcomes for supporting endless mode will be the
> removal of hangchecking, at which point opting into persistent mode will
> be mandatory, or maybe the default.)

For the record, I've finally run into examples of desktop clients
exiting before their rendering is shown. No longer hypothetical.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-09 23:34   ` Chris Wilson
@ 2019-08-12 14:39     ` Bloomfield, Jon
  2019-08-12 14:51       ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Bloomfield, Jon @ 2019-08-12 14:39 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Friday, August 9, 2019 4:35 PM
> To: intel-gfx@lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> <michal.winiarski@intel.com>; Bloomfield, Jon <jon.bloomfield@intel.com>
> Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Chris Wilson (2019-08-06 14:47:25)
> > Normally, we rely on our hangcheck to prevent persistent batches from
> > hogging the GPU. However, if the user disables hangcheck, this mechanism
> > breaks down. Despite our insistence that this is unsafe, the users are
> > equally insistent that they want to use endless batches and will disable
> > the hangcheck mechanism. We are looking are perhaps replacing hangcheck
> > with a softer mechanism, that sends a pulse down the engine to check if
> > it is well. We can use the same preemptive pulse to flush an active
> > persistent context off the GPU upon context close, preventing resources
> > being lost and unkillable requests remaining on the GPU, after process
> > termination. To avoid changing the ABI and accidentally breaking
> > existing userspace, we make the persistence of a context explicit and
> > enable it by default. Userspace can opt out of persistent mode (forcing
> > requests to be cancelled when the context is closed by process
> > termination or explicitly) by a context parameter, or to facilitate
> > existing use-cases by disabling hangcheck (i915.enable_hangcheck=0).
> > (Note, one of the outcomes for supporting endless mode will be the
> > removal of hangchecking, at which point opting into persistent mode will
> > be mandatory, or maybe the default.)
> 
> For the record, I've finally run into examples of desktop clients
> exiting before their rendering is shown. No longer hypothetical.
> -Chris

Can you share any details - Might be useful for testing.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
  2019-08-12 14:39     ` Bloomfield, Jon
@ 2019-08-12 14:51       ` Chris Wilson
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2019-08-12 14:51 UTC (permalink / raw)
  To: Bloomfield, Jon, intel-gfx

Quoting Bloomfield, Jon (2019-08-12 15:39:33)
> > -----Original Message-----
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > Sent: Friday, August 9, 2019 4:35 PM
> > To: intel-gfx@lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>; Winiarski, Michal
> > <michal.winiarski@intel.com>; Bloomfield, Jon <jon.bloomfield@intel.com>
> > Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > 
> > Quoting Chris Wilson (2019-08-06 14:47:25)
> > > Normally, we rely on our hangcheck to prevent persistent batches from
> > > hogging the GPU. However, if the user disables hangcheck, this mechanism
> > > breaks down. Despite our insistence that this is unsafe, the users are
> > > equally insistent that they want to use endless batches and will disable
> > > the hangcheck mechanism. We are looking are perhaps replacing hangcheck
> > > with a softer mechanism, that sends a pulse down the engine to check if
> > > it is well. We can use the same preemptive pulse to flush an active
> > > persistent context off the GPU upon context close, preventing resources
> > > being lost and unkillable requests remaining on the GPU, after process
> > > termination. To avoid changing the ABI and accidentally breaking
> > > existing userspace, we make the persistence of a context explicit and
> > > enable it by default. Userspace can opt out of persistent mode (forcing
> > > requests to be cancelled when the context is closed by process
> > > termination or explicitly) by a context parameter, or to facilitate
> > > existing use-cases by disabling hangcheck (i915.enable_hangcheck=0).
> > > (Note, one of the outcomes for supporting endless mode will be the
> > > removal of hangchecking, at which point opting into persistent mode will
> > > be mandatory, or maybe the default.)
> > 
> > For the record, I've finally run into examples of desktop clients
> > exiting before their rendering is shown. No longer hypothetical.
> > -Chris
> 
> Can you share any details - Might be useful for testing.

In cinnamon, the atl-tab switcher seems to be one. Another one seems to
be around firefox, but I haven't established the trigger. I should
actually log who owned the context!
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-08-12 14:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-06 13:47 [PATCH 1/5] drm/i915: Only enqueue already completed requests Chris Wilson
2019-08-06 13:47 ` [PATCH 2/5] drm/i915/execlists: Force preemption Chris Wilson
2019-08-06 13:47 ` [PATCH 3/5] drm/i915: Mark up "sentinel" requests Chris Wilson
2019-08-06 14:29   ` Mika Kuoppala
2019-08-06 14:47     ` Chris Wilson
2019-08-06 13:47 ` [PATCH 4/5] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
2019-08-06 13:47 ` [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close Chris Wilson
2019-08-06 14:11   ` Chris Wilson
2019-08-06 14:26   ` Bloomfield, Jon
2019-08-06 14:41     ` Chris Wilson
2019-08-07 13:22   ` Chris Wilson
2019-08-07 14:04     ` Bloomfield, Jon
2019-08-07 14:14       ` Chris Wilson
2019-08-07 14:33         ` Bloomfield, Jon
2019-08-07 15:08           ` Chris Wilson
2019-08-07 15:29             ` Bloomfield, Jon
2019-08-07 15:38               ` Chris Wilson
2019-08-07 16:51               ` Chris Wilson
2019-08-07 17:12                 ` Bloomfield, Jon
2019-08-09 23:34   ` Chris Wilson
2019-08-12 14:39     ` Bloomfield, Jon
2019-08-12 14:51       ` Chris Wilson
2019-08-06 14:25 ` [PATCH 1/5] drm/i915: Only enqueue already completed requests Mika Kuoppala
2019-08-06 14:44   ` Chris Wilson
2019-08-06 14:51 ` ✗ Fi.CI.BAT: failure for series starting with [1/5] " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.