All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Force preemption
@ 2018-03-21 17:26 jeff.mcgee
  2018-03-21 17:26 ` [RFC 1/8] drm/i915/execlists: Refactor out complete_preempt_context() jeff.mcgee
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Force preemption uses engine reset to enforce a limit on the time
that a request targeted for preemption can block. This feature is
a requirement in automotive systems where the GPU may be shared by
clients of critically high priority and clients of low priority that
may not have been curated to be preemption friendly. There may be
more general applications of this feature. I'm sharing as an RFC to
stimulate that discussion and also to get any technical feedback
that I can before submitting to the product kernel that needs this.
I have developed the patches for ease of rebase, given that this is
for the moment considered a non-upstreamable feature. It would be
possible to refactor hangcheck to fully incorporate force preemption
as another tier of patience (or impatience) with the running request.

Chris Wilson (5):
  drm/i915/execlists: Refactor out complete_preempt_context()
  drm/i915: Add control flags to i915_handle_error()
  drm/i915: Move engine reset prepare/finish to backends
  drm/i915: Split execlists/guc reset prepartions
  drm/i915/execlists: Flush pending preemption events during reset

Jeff McGee (3):
  drm/i915: Fix loop on CSB processing
  drm/i915: Skip CSB processing on invalid CSB tail
  drm/i915: Force preemption to complete via engine reset

 drivers/gpu/drm/i915/i915_debugfs.c              |   4 +-
 drivers/gpu/drm/i915/i915_drv.c                  |  17 +-
 drivers/gpu/drm/i915/i915_drv.h                  |  10 +-
 drivers/gpu/drm/i915/i915_gem.c                  |  69 ++--
 drivers/gpu/drm/i915/i915_gpu_error.h            |   3 +
 drivers/gpu/drm/i915/i915_irq.c                  |  55 +--
 drivers/gpu/drm/i915/i915_params.c               |   3 +
 drivers/gpu/drm/i915/i915_params.h               |   1 +
 drivers/gpu/drm/i915/i915_request.c              |   2 +-
 drivers/gpu/drm/i915/intel_engine_cs.c           |  40 +++
 drivers/gpu/drm/i915/intel_guc_submission.c      |  39 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |   8 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 436 ++++++++++++++---------
 drivers/gpu/drm/i915/intel_ringbuffer.c          |  20 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |  13 +-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  13 +-
 16 files changed, 469 insertions(+), 264 deletions(-)

-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 1/8] drm/i915/execlists: Refactor out complete_preempt_context()
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:26 ` [RFC 2/8] drm/i915: Add control flags to i915_handle_error() jeff.mcgee
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Chris Wilson <chris@chris-wilson.co.uk>

As a complement to inject_preempt_context(), follow up with the function
to handle its completion. This will be useful should we wish to extend
the duties of the preempt-context for execlists.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 53f1c009ed7b..0bfaeb56b8c7 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -531,8 +531,17 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
 	if (execlists->ctrl_reg)
 		writel(EL_CTRL_LOAD, execlists->ctrl_reg);
 
-	execlists_clear_active(&engine->execlists, EXECLISTS_ACTIVE_HWACK);
-	execlists_set_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT);
+	execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
+	execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+}
+
+static void complete_preempt_context(struct intel_engine_execlists *execlists)
+{
+	execlists_cancel_port_requests(execlists);
+	execlists_unwind_incomplete_requests(execlists);
+
+	GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
+	execlists_clear_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
 }
 
 static void execlists_dequeue(struct intel_engine_cs *engine)
@@ -939,14 +948,7 @@ static void execlists_submission_tasklet(unsigned long data)
 			if (status & GEN8_CTX_STATUS_COMPLETE &&
 			    buf[2*head + 1] == execlists->preempt_complete_status) {
 				GEM_TRACE("%s preempt-idle\n", engine->name);
-
-				execlists_cancel_port_requests(execlists);
-				execlists_unwind_incomplete_requests(execlists);
-
-				GEM_BUG_ON(!execlists_is_active(execlists,
-								EXECLISTS_ACTIVE_PREEMPT));
-				execlists_clear_active(execlists,
-						       EXECLISTS_ACTIVE_PREEMPT);
+				complete_preempt_context(execlists);
 				continue;
 			}
 
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 2/8] drm/i915: Add control flags to i915_handle_error()
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-21 17:26 ` [RFC 1/8] drm/i915/execlists: Refactor out complete_preempt_context() jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:26 ` [RFC 3/8] drm/i915: Move engine reset prepare/finish to backends jeff.mcgee
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally, Mika Kuoppala

From: Chris Wilson <chris@chris-wilson.co.uk>

Not all callers want the GPU error to handled in the same way, so expose
a control parameter. In the first instance, some callers do not want the
heavyweight error capture so add a bit to request the state to be
captured and saved.

v2: Pass msg down to i915_reset/i915_reset_engine so that we include the
reason for the reset in the dev_notice(), superseding the earlier option
to not print that notice.
v3: Stash the reason inside the i915->gpu_error to handover to the direct
reset from the blocking waiter.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c              |  4 +-
 drivers/gpu/drm/i915/i915_drv.c                  | 17 ++++----
 drivers/gpu/drm/i915/i915_drv.h                  | 10 ++---
 drivers/gpu/drm/i915/i915_gpu_error.h            |  3 ++
 drivers/gpu/drm/i915/i915_irq.c                  | 55 ++++++++++++++----------
 drivers/gpu/drm/i915/i915_request.c              |  2 +-
 drivers/gpu/drm/i915/intel_hangcheck.c           |  8 ++--
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 13 +++---
 8 files changed, 60 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 964ea1a12357..7816cd53100a 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4011,8 +4011,8 @@ i915_wedged_set(void *data, u64 val)
 		engine->hangcheck.stalled = true;
 	}
 
-	i915_handle_error(i915, val, "Manually set wedged engine mask = %llx",
-			  val);
+	i915_handle_error(i915, val, I915_ERROR_CAPTURE,
+			  "Manually set wedged engine mask = %llx", val);
 
 	wait_on_bit(&i915->gpu_error.flags,
 		    I915_RESET_HANDOFF,
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 1021bf40e236..6b04cc3be513 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1869,7 +1869,6 @@ static int i915_resume_switcheroo(struct drm_device *dev)
 /**
  * i915_reset - reset chip after a hang
  * @i915: #drm_i915_private to reset
- * @flags: Instructions
  *
  * Reset the chip.  Useful if a hang is detected. Marks the device as wedged
  * on failure.
@@ -1884,7 +1883,7 @@ static int i915_resume_switcheroo(struct drm_device *dev)
  *   - re-init interrupt state
  *   - re-init display
  */
-void i915_reset(struct drm_i915_private *i915, unsigned int flags)
+void i915_reset(struct drm_i915_private *i915)
 {
 	struct i915_gpu_error *error = &i915->gpu_error;
 	int ret;
@@ -1901,8 +1900,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
 	if (!i915_gem_unset_wedged(i915))
 		goto wakeup;
 
-	if (!(flags & I915_RESET_QUIET))
-		dev_notice(i915->drm.dev, "Resetting chip after gpu hang\n");
+	if (error->reason)
+		dev_notice(i915->drm.dev,
+			   "Resetting chip for %s\n", error->reason);
 	error->reset_count++;
 
 	disable_irq(i915->drm.irq);
@@ -2003,7 +2003,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
 /**
  * i915_reset_engine - reset GPU engine to recover from a hang
  * @engine: engine to reset
- * @flags: options
+ * @msg: reason for GPU reset; or NULL for no dev_notice()
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
@@ -2013,7 +2013,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
  *  - reset engine (which will force the engine to idle)
  *  - re-init/configure engine
  */
-int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
+int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
 {
 	struct i915_gpu_error *error = &engine->i915->gpu_error;
 	struct i915_request *active_request;
@@ -2028,10 +2028,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
 		goto out;
 	}
 
-	if (!(flags & I915_RESET_QUIET)) {
+	if (msg)
 		dev_notice(engine->i915->drm.dev,
-			   "Resetting %s after gpu hang\n", engine->name);
-	}
+			   "Resetting %s for %s\n", engine->name, msg);
 	error->reset_engine_count[engine->id]++;
 
 	if (!engine->i915->guc.execbuf_client)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e27ba8fb64e6..c9c3b2ba6a86 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2700,10 +2700,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 
-#define I915_RESET_QUIET BIT(0)
-extern void i915_reset(struct drm_i915_private *i915, unsigned int flags);
-extern int i915_reset_engine(struct intel_engine_cs *engine,
-			     unsigned int flags);
+extern void i915_reset(struct drm_i915_private *i915);
+extern int i915_reset_engine(struct intel_engine_cs *engine, const char *msg);
 
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_guc(struct drm_i915_private *dev_priv);
@@ -2751,10 +2749,12 @@ static inline void i915_queue_hangcheck(struct drm_i915_private *dev_priv)
 			   &dev_priv->gpu_error.hangcheck_work, delay);
 }
 
-__printf(3, 4)
+__printf(4, 5)
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
+		       unsigned long flags,
 		       const char *fmt, ...);
+#define I915_ERROR_CAPTURE BIT(0)
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index ebbdf37e2879..ac5760673cc9 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -269,6 +269,9 @@ struct i915_gpu_error {
 	/** Number of times an engine has been reset */
 	u32 reset_engine_count[I915_NUM_ENGINES];
 
+	/** Reason for the current *global* reset */
+	const char *reason;
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 44eef355e12c..fa7310766217 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2877,15 +2877,10 @@ static irqreturn_t gen11_irq_handler(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
-/**
- * i915_reset_device - do process context error handling work
- * @dev_priv: i915 device private
- *
- * Fire an error uevent so userspace can see that a hang or error
- * was detected.
- */
-static void i915_reset_device(struct drm_i915_private *dev_priv)
+static void i915_reset_device(struct drm_i915_private *dev_priv,
+			      const char *msg)
 {
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
@@ -2901,29 +2896,32 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
 	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
 		intel_prepare_reset(dev_priv);
 
+		error->reason = msg;
+
 		/* Signal that locked waiters should reset the GPU */
-		set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
-		wake_up_all(&dev_priv->gpu_error.wait_queue);
+		set_bit(I915_RESET_HANDOFF, &error->flags);
+		wake_up_all(&error->wait_queue);
 
 		/* Wait for anyone holding the lock to wakeup, without
 		 * blocking indefinitely on struct_mutex.
 		 */
 		do {
 			if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-				i915_reset(dev_priv, 0);
+				i915_reset(dev_priv);
 				mutex_unlock(&dev_priv->drm.struct_mutex);
 			}
-		} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
+		} while (wait_on_bit_timeout(&error->flags,
 					     I915_RESET_HANDOFF,
 					     TASK_UNINTERRUPTIBLE,
 					     1));
 
+		error->reason = NULL;
+
 		intel_finish_reset(dev_priv);
 	}
 
-	if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags))
-		kobject_uevent_env(kobj,
-				   KOBJ_CHANGE, reset_done_event);
+	if (!test_bit(I915_WEDGED, &error->flags))
+		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
 }
 
 static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
@@ -2955,6 +2953,7 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @flags: control flags
  * @fmt: Error message format string
  *
  * Do some basic checking of register state at error time and
@@ -2965,16 +2964,23 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  */
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
+		       unsigned long flags,
 		       const char *fmt, ...)
 {
 	struct intel_engine_cs *engine;
 	unsigned int tmp;
-	va_list args;
 	char error_msg[80];
+	char *msg = NULL;
 
-	va_start(args, fmt);
-	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-	va_end(args);
+	if (fmt) {
+		va_list args;
+
+		va_start(args, fmt);
+		vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+		va_end(args);
+
+		msg = error_msg;
+	}
 
 	/*
 	 * In most cases it's guaranteed that we get here with an RPM
@@ -2986,8 +2992,11 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	intel_runtime_pm_get(dev_priv);
 
 	engine_mask &= INTEL_INFO(dev_priv)->ring_mask;
-	i915_capture_error_state(dev_priv, engine_mask, error_msg);
-	i915_clear_error_registers(dev_priv);
+
+	if (flags & I915_ERROR_CAPTURE) {
+		i915_capture_error_state(dev_priv, engine_mask, msg);
+		i915_clear_error_registers(dev_priv);
+	}
 
 	/*
 	 * Try engine reset when available. We fall back to full reset if
@@ -3000,7 +3009,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 					     &dev_priv->gpu_error.flags))
 				continue;
 
-			if (i915_reset_engine(engine, 0) == 0)
+			if (i915_reset_engine(engine, msg) == 0)
 				engine_mask &= ~intel_engine_flag(engine);
 
 			clear_bit(I915_RESET_ENGINE + engine->id,
@@ -3030,7 +3039,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 				    TASK_UNINTERRUPTIBLE);
 	}
 
-	i915_reset_device(dev_priv);
+	i915_reset_device(dev_priv, msg);
 
 	for_each_engine(engine, dev_priv, tmp) {
 		clear_bit(I915_RESET_ENGINE + engine->id,
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 43c7134a9b93..2325886d1d55 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1229,7 +1229,7 @@ static bool __i915_wait_request_check_and_reset(struct i915_request *request)
 		return false;
 
 	__set_current_state(TASK_RUNNING);
-	i915_reset(request->i915, 0);
+	i915_reset(request->i915);
 	return true;
 }
 
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 42e45ae87393..e7776fbd9ebf 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -246,7 +246,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(engine);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev_priv, 0,
+		i915_handle_error(dev_priv, 0, 0,
 				  "Kicking stuck wait on %s",
 				  engine->name);
 		I915_WRITE_CTL(engine, tmp);
@@ -258,7 +258,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 		default:
 			return ENGINE_DEAD;
 		case 1:
-			i915_handle_error(dev_priv, 0,
+			i915_handle_error(dev_priv, 0, 0,
 					  "Kicking stuck semaphore on %s",
 					  engine->name);
 			I915_WRITE_CTL(engine, tmp);
@@ -386,13 +386,13 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", stuck == hung ? "no progress" : "hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
 	msg[len-2] = '\0';
 
-	return i915_handle_error(i915, hung, "%s", msg);
+	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index df7898c8edcb..4372826998aa 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -433,7 +433,7 @@ static int igt_global_reset(void *arg)
 	mutex_lock(&i915->drm.struct_mutex);
 	reset_count = i915_reset_count(&i915->gpu_error);
 
-	i915_reset(i915, I915_RESET_QUIET);
+	i915_reset(i915);
 
 	if (i915_reset_count(&i915->gpu_error) == reset_count) {
 		pr_err("No GPU reset recorded!\n");
@@ -518,7 +518,7 @@ static int __igt_reset_engine(struct drm_i915_private *i915, bool active)
 			engine->hangcheck.seqno =
 				intel_engine_get_seqno(engine);
 
-			err = i915_reset_engine(engine, I915_RESET_QUIET);
+			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine failed\n");
 				break;
@@ -725,7 +725,7 @@ static int __igt_reset_engine_others(struct drm_i915_private *i915,
 			engine->hangcheck.seqno =
 				intel_engine_get_seqno(engine);
 
-			err = i915_reset_engine(engine, I915_RESET_QUIET);
+			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine(%s:%s) failed, err=%d\n",
 				       engine->name, active ? "active" : "idle", err);
@@ -865,7 +865,6 @@ static int igt_wait_reset(void *arg)
 		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
 		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
 
-		i915_reset(i915, 0);
 		i915_gem_set_wedged(i915);
 
 		err = -EIO;
@@ -962,7 +961,6 @@ static int igt_reset_queue(void *arg)
 				i915_request_put(rq);
 				i915_request_put(prev);
 
-				i915_reset(i915, 0);
 				i915_gem_set_wedged(i915);
 
 				err = -EIO;
@@ -971,7 +969,7 @@ static int igt_reset_queue(void *arg)
 
 			reset_count = fake_hangcheck(prev);
 
-			i915_reset(i915, I915_RESET_QUIET);
+			i915_reset(i915);
 
 			GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
 					    &i915->gpu_error.flags));
@@ -1069,7 +1067,6 @@ static int igt_handle_error(void *arg)
 		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
 		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
 
-		i915_reset(i915, 0);
 		i915_gem_set_wedged(i915);
 
 		err = -EIO;
@@ -1084,7 +1081,7 @@ static int igt_handle_error(void *arg)
 	engine->hangcheck.stalled = true;
 	engine->hangcheck.seqno = intel_engine_get_seqno(engine);
 
-	i915_handle_error(i915, intel_engine_flag(engine), "%s", __func__);
+	i915_handle_error(i915, intel_engine_flag(engine), 0, NULL);
 
 	xchg(&i915->gpu_error.first_error, error);
 
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 3/8] drm/i915: Move engine reset prepare/finish to backends
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-21 17:26 ` [RFC 1/8] drm/i915/execlists: Refactor out complete_preempt_context() jeff.mcgee
  2018-03-21 17:26 ` [RFC 2/8] drm/i915: Add control flags to i915_handle_error() jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:26 ` [RFC 4/8] drm/i915: Split execlists/guc reset prepartions jeff.mcgee
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Chris Wilson <chris@chris-wilson.co.uk>

In preparation to more carefully handling incomplete preemption during
reset by execlists, we move the existing code wholesale to the backends
under a couple of new reset vfuncs.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
CC: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 42 ++++----------------------
 drivers/gpu/drm/i915/intel_lrc.c        | 52 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.c | 20 +++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h |  9 ++++--
 4 files changed, 78 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 802df8e1a544..38f7160d99c9 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2917,7 +2917,7 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 struct i915_request *
 i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct i915_request *request = NULL;
+	struct i915_request *request;
 
 	/*
 	 * During the reset sequence, we must prevent the engine from
@@ -2940,40 +2940,7 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	 */
 	kthread_park(engine->breadcrumbs.signaler);
 
-	/*
-	 * Prevent request submission to the hardware until we have
-	 * completed the reset in i915_gem_reset_finish(). If a request
-	 * is completed by one engine, it may then queue a request
-	 * to a second via its execlists->tasklet *just* as we are
-	 * calling engine->init_hw() and also writing the ELSP.
-	 * Turning off the execlists->tasklet until the reset is over
-	 * prevents the race.
-	 *
-	 * Note that this needs to be a single atomic operation on the
-	 * tasklet (flush existing tasks, prevent new tasks) to prevent
-	 * a race between reset and set-wedged. It is not, so we do the best
-	 * we can atm and make sure we don't lock the machine up in the more
-	 * common case of recursively being called from set-wedged from inside
-	 * i915_reset.
-	 */
-	if (!atomic_read(&engine->execlists.tasklet.count))
-		tasklet_kill(&engine->execlists.tasklet);
-	tasklet_disable(&engine->execlists.tasklet);
-
-	/*
-	 * We're using worker to queue preemption requests from the tasklet in
-	 * GuC submission mode.
-	 * Even though tasklet was disabled, we may still have a worker queued.
-	 * Let's make sure that all workers scheduled before disabling the
-	 * tasklet are completed before continuing with the reset.
-	 */
-	if (engine->i915->guc.preempt_wq)
-		flush_workqueue(engine->i915->guc.preempt_wq);
-
-	if (engine->irq_seqno_barrier)
-		engine->irq_seqno_barrier(engine);
-
-	request = i915_gem_find_active_request(engine);
+	request = engine->reset.prepare(engine);
 	if (request && request->fence.error == -EIO)
 		request = ERR_PTR(-EIO); /* Previous reset failed! */
 
@@ -3120,7 +3087,7 @@ void i915_gem_reset_engine(struct intel_engine_cs *engine,
 	}
 
 	/* Setup the CS to resume from the breadcrumb of the hung request */
-	engine->reset_hw(engine, request);
+	engine->reset.reset(engine, request);
 }
 
 void i915_gem_reset(struct drm_i915_private *dev_priv)
@@ -3172,7 +3139,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
 {
-	tasklet_enable(&engine->execlists.tasklet);
+	engine->reset.finish(engine);
+
 	kthread_unpark(engine->breadcrumbs.signaler);
 
 	intel_uncore_forcewake_put(engine->i915, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0bfaeb56b8c7..f662a9524233 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1663,6 +1663,44 @@ static int gen9_init_render_ring(struct intel_engine_cs *engine)
 	return init_workarounds_ring(engine);
 }
 
+static struct i915_request *
+execlists_reset_prepare(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+
+	/*
+	 * Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its execlists->tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the execlists->tasklet until the reset is over
+	 * prevents the race.
+	 *
+	 * Note that this needs to be a single atomic operation on the
+	 * tasklet (flush existing tasks, prevent new tasks) to prevent
+	 * a race between reset and set-wedged. It is not, so we do the best
+	 * we can atm and make sure we don't lock the machine up in the more
+	 * common case of recursively being called from set-wedged from inside
+	 * i915_reset.
+	 */
+	if (!atomic_read(&execlists->tasklet.count))
+		tasklet_kill(&execlists->tasklet);
+	tasklet_disable(&execlists->tasklet);
+
+	/*
+	 * We're using worker to queue preemption requests from the tasklet in
+	 * GuC submission mode.
+	 * Even though tasklet was disabled, we may still have a worker queued.
+	 * Let's make sure that all workers scheduled before disabling the
+	 * tasklet are completed before continuing with the reset.
+	 */
+	if (engine->i915->guc.preempt_wq)
+		flush_workqueue(engine->i915->guc.preempt_wq);
+
+	return i915_gem_find_active_request(engine);
+}
+
 static void reset_irq(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -1692,8 +1730,8 @@ static void reset_irq(struct intel_engine_cs *engine)
 	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
 }
 
-static void reset_common_ring(struct intel_engine_cs *engine,
-			      struct i915_request *request)
+static void execlists_reset(struct intel_engine_cs *engine,
+			    struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	struct intel_context *ce;
@@ -1766,6 +1804,11 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	unwind_wa_tail(request);
 }
 
+static void execlists_reset_finish(struct intel_engine_cs *engine)
+{
+	tasklet_enable(&engine->execlists.tasklet);
+}
+
 static int intel_logical_ring_emit_pdps(struct i915_request *rq)
 {
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
@@ -2090,7 +2133,10 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 {
 	/* Default vfuncs which can be overriden by each engine. */
 	engine->init_hw = gen8_init_common_ring;
-	engine->reset_hw = reset_common_ring;
+
+	engine->reset.prepare = execlists_reset_prepare;
+	engine->reset.reset = execlists_reset;
+	engine->reset.finish = execlists_reset_finish;
 
 	engine->context_pin = execlists_context_pin;
 	engine->context_unpin = execlists_context_unpin;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 04d9d9a946a7..eebcc877ef60 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -530,8 +530,16 @@ static int init_ring_common(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void reset_ring_common(struct intel_engine_cs *engine,
-			      struct i915_request *request)
+static struct i915_request *reset_prepare(struct intel_engine_cs *engine)
+{
+	if (engine->irq_seqno_barrier)
+		engine->irq_seqno_barrier(engine);
+
+	return i915_gem_find_active_request(engine);
+}
+
+static void reset_ring(struct intel_engine_cs *engine,
+		       struct i915_request *request)
 {
 	/*
 	 * RC6 must be prevented until the reset is complete and the engine
@@ -595,6 +603,10 @@ static void reset_ring_common(struct intel_engine_cs *engine,
 	}
 }
 
+static void reset_finish(struct intel_engine_cs *engine)
+{
+}
+
 static int intel_rcs_ctx_init(struct i915_request *rq)
 {
 	int ret;
@@ -1987,7 +1999,9 @@ static void intel_ring_default_vfuncs(struct drm_i915_private *dev_priv,
 	intel_ring_init_semaphores(dev_priv, engine);
 
 	engine->init_hw = init_ring_common;
-	engine->reset_hw = reset_ring_common;
+	engine->reset.prepare = reset_prepare;
+	engine->reset.reset = reset_ring;
+	engine->reset.finish = reset_finish;
 
 	engine->context_pin = intel_ring_context_pin;
 	engine->context_unpin = intel_ring_context_unpin;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 1f50727a5ddb..e2681303ce21 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -418,8 +418,13 @@ struct intel_engine_cs {
 	void		(*irq_disable)(struct intel_engine_cs *engine);
 
 	int		(*init_hw)(struct intel_engine_cs *engine);
-	void		(*reset_hw)(struct intel_engine_cs *engine,
-				    struct i915_request *rq);
+
+	struct {
+		struct i915_request *(*prepare)(struct intel_engine_cs *engine);
+		void (*reset)(struct intel_engine_cs *engine,
+			      struct i915_request *rq);
+		void (*finish)(struct intel_engine_cs *engine);
+	} reset;
 
 	void		(*park)(struct intel_engine_cs *engine);
 	void		(*unpark)(struct intel_engine_cs *engine);
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 4/8] drm/i915: Split execlists/guc reset prepartions
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (2 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 3/8] drm/i915: Move engine reset prepare/finish to backends jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:26 ` [RFC 5/8] drm/i915/execlists: Flush pending preemption events during reset jeff.mcgee
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Chris Wilson <chris@chris-wilson.co.uk>

In the next patch, we will make the execlists reset prepare callback
take into account preemption by flushing the context-switch handler.
This is not applicable to the GuC submission backend, so split the two
into their own backend callbacks.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
CC: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_guc_submission.c | 39 +++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c            | 11 +-------
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c b/drivers/gpu/drm/i915/intel_guc_submission.c
index 207cda062626..779c8d3156e5 100644
--- a/drivers/gpu/drm/i915/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/intel_guc_submission.c
@@ -776,6 +776,44 @@ static void guc_submission_tasklet(unsigned long data)
 		guc_dequeue(engine);
 }
 
+static struct i915_request *
+guc_reset_prepare(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+
+	/*
+	 * Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its execlists->tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the execlists->tasklet until the reset is over
+	 * prevents the race.
+	 *
+	 * Note that this needs to be a single atomic operation on the
+	 * tasklet (flush existing tasks, prevent new tasks) to prevent
+	 * a race between reset and set-wedged. It is not, so we do the best
+	 * we can atm and make sure we don't lock the machine up in the more
+	 * common case of recursively being called from set-wedged from inside
+	 * i915_reset.
+	 */
+	if (!atomic_read(&execlists->tasklet.count))
+		tasklet_kill(&execlists->tasklet);
+	tasklet_disable(&execlists->tasklet);
+
+	/*
+	 * We're using worker to queue preemption requests from the tasklet in
+	 * GuC submission mode.
+	 * Even though tasklet was disabled, we may still have a worker queued.
+	 * Let's make sure that all workers scheduled before disabling the
+	 * tasklet are completed before continuing with the reset.
+	 */
+	if (engine->i915->guc.preempt_wq)
+		flush_workqueue(engine->i915->guc.preempt_wq);
+
+	return i915_gem_find_active_request(engine);
+}
+
 /*
  * Everything below here is concerned with setup & teardown, and is
  * therefore not part of the somewhat time-critical batch-submission
@@ -1235,6 +1273,7 @@ int intel_guc_submission_enable(struct intel_guc *guc)
 			&engine->execlists;
 
 		execlists->tasklet.func = guc_submission_tasklet;
+		engine->reset.prepare = guc_reset_prepare;
 		engine->park = guc_submission_park;
 		engine->unpark = guc_submission_unpark;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f662a9524233..e5a3ffbc273a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1688,16 +1688,6 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
 		tasklet_kill(&execlists->tasklet);
 	tasklet_disable(&execlists->tasklet);
 
-	/*
-	 * We're using worker to queue preemption requests from the tasklet in
-	 * GuC submission mode.
-	 * Even though tasklet was disabled, we may still have a worker queued.
-	 * Let's make sure that all workers scheduled before disabling the
-	 * tasklet are completed before continuing with the reset.
-	 */
-	if (engine->i915->guc.preempt_wq)
-		flush_workqueue(engine->i915->guc.preempt_wq);
-
 	return i915_gem_find_active_request(engine);
 }
 
@@ -2115,6 +2105,7 @@ static void execlists_set_default_submission(struct intel_engine_cs *engine)
 	engine->cancel_requests = execlists_cancel_requests;
 	engine->schedule = execlists_schedule;
 	engine->execlists.tasklet.func = execlists_submission_tasklet;
+	engine->reset.prepare = execlists_reset_prepare;
 
 	engine->park = NULL;
 	engine->unpark = NULL;
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 5/8] drm/i915/execlists: Flush pending preemption events during reset
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (3 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 4/8] drm/i915: Split execlists/guc reset prepartions jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:26 ` [RFC 6/8] drm/i915: Fix loop on CSB processing jeff.mcgee
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Chris Wilson <chris@chris-wilson.co.uk>

Catch up with the inflight CSB events, after disabling the tasklet
before deciding which request was truly guilty of hanging the GPU.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
CC: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 355 ++++++++++++++++++++++-----------------
 1 file changed, 197 insertions(+), 158 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e5a3ffbc273a..beb81f13a3cc 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -828,186 +828,192 @@ static void execlists_cancel_requests(struct intel_engine_cs *engine)
 	local_irq_restore(flags);
 }
 
-/*
- * Check the unread Context Status Buffers and manage the submission of new
- * contexts to the ELSP accordingly.
- */
-static void execlists_submission_tasklet(unsigned long data)
+static void process_csb(struct intel_engine_cs *engine)
 {
-	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
 	struct intel_engine_execlists * const execlists = &engine->execlists;
 	struct execlist_port * const port = execlists->port;
-	struct drm_i915_private *dev_priv = engine->i915;
+	struct drm_i915_private *i915 = engine->i915;
+	unsigned int head, tail;
+	const u32 *buf;
 	bool fw = false;
 
-	/* We can skip acquiring intel_runtime_pm_get() here as it was taken
-	 * on our behalf by the request (see i915_gem_mark_busy()) and it will
-	 * not be relinquished until the device is idle (see
-	 * i915_gem_idle_work_handler()). As a precaution, we make sure
-	 * that all ELSP are drained i.e. we have processed the CSB,
-	 * before allowing ourselves to idle and calling intel_runtime_pm_put().
-	 */
-	GEM_BUG_ON(!dev_priv->gt.awake);
+	if (unlikely(execlists->csb_use_mmio)) {
+		buf = (u32 * __force)
+			(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_BUF_LO(engine, 0)));
+		execlists->csb_head = -1; /* force mmio read of CSB ptrs */
+	} else {
+		/* The HWSP contains a (cacheable) mirror of the CSB */
+		buf = &engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
+	}
 
-	/* Prefer doing test_and_clear_bit() as a two stage operation to avoid
-	 * imposing the cost of a locked atomic transaction when submitting a
-	 * new request (outside of the context-switch interrupt).
+	/*
+	 * The write will be ordered by the uncached read (itself
+	 * a memory barrier), so we do not need another in the form
+	 * of a locked instruction. The race between the interrupt
+	 * handler and the split test/clear is harmless as we order
+	 * our clear before the CSB read. If the interrupt arrived
+	 * first between the test and the clear, we read the updated
+	 * CSB and clear the bit. If the interrupt arrives as we read
+	 * the CSB or later (i.e. after we had cleared the bit) the bit
+	 * is set and we do a new loop.
 	 */
-	while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted)) {
-		/* The HWSP contains a (cacheable) mirror of the CSB */
-		const u32 *buf =
-			&engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
-		unsigned int head, tail;
-
-		if (unlikely(execlists->csb_use_mmio)) {
-			buf = (u32 * __force)
-				(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_BUF_LO(engine, 0)));
-			execlists->csb_head = -1; /* force mmio read of CSB ptrs */
-		}
+	__clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
+	if (unlikely(execlists->csb_head == -1)) { /* following a reset */
+		intel_uncore_forcewake_get(i915, execlists->fw_domains);
+		fw = true;
+
+		head = readl(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
+		tail = GEN8_CSB_WRITE_PTR(head);
+		head = GEN8_CSB_READ_PTR(head);
+		execlists->csb_head = head;
+	} else {
+		const int write_idx =
+			intel_hws_csb_write_index(i915) -
+			I915_HWS_CSB_BUF0_INDEX;
+
+		head = execlists->csb_head;
+		tail = READ_ONCE(buf[write_idx]);
+	}
+	GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
+		  engine->name,
+		  head, GEN8_CSB_READ_PTR(readl(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?",
+		  tail, GEN8_CSB_WRITE_PTR(readl(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?");
+
+	while (head != tail) {
+		struct i915_request *rq;
+		unsigned int status;
+		unsigned int count;
+
+		if (++head == GEN8_CSB_ENTRIES)
+			head = 0;
 
-		/* The write will be ordered by the uncached read (itself
-		 * a memory barrier), so we do not need another in the form
-		 * of a locked instruction. The race between the interrupt
-		 * handler and the split test/clear is harmless as we order
-		 * our clear before the CSB read. If the interrupt arrived
-		 * first between the test and the clear, we read the updated
-		 * CSB and clear the bit. If the interrupt arrives as we read
-		 * the CSB or later (i.e. after we had cleared the bit) the bit
-		 * is set and we do a new loop.
+		/*
+		 * We are flying near dragons again.
+		 *
+		 * We hold a reference to the request in execlist_port[]
+		 * but no more than that. We are operating in softirq
+		 * context and so cannot hold any mutex or sleep. That
+		 * prevents us stopping the requests we are processing
+		 * in port[] from being retired simultaneously (the
+		 * breadcrumb will be complete before we see the
+		 * context-switch). As we only hold the reference to the
+		 * request, any pointer chasing underneath the request
+		 * is subject to a potential use-after-free. Thus we
+		 * store all of the bookkeeping within port[] as
+		 * required, and avoid using unguarded pointers beneath
+		 * request itself. The same applies to the atomic
+		 * status notifier.
 		 */
-		__clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
-		if (unlikely(execlists->csb_head == -1)) { /* following a reset */
-			if (!fw) {
-				intel_uncore_forcewake_get(dev_priv,
-							   execlists->fw_domains);
-				fw = true;
-			}
 
-			head = readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
-			tail = GEN8_CSB_WRITE_PTR(head);
-			head = GEN8_CSB_READ_PTR(head);
-			execlists->csb_head = head;
-		} else {
-			const int write_idx =
-				intel_hws_csb_write_index(dev_priv) -
-				I915_HWS_CSB_BUF0_INDEX;
+		status = READ_ONCE(buf[2 * head]); /* maybe mmio! */
+		GEM_TRACE("%s csb[%d]: status=0x%08x:0x%08x, active=0x%x\n",
+			  engine->name, head,
+			  status, buf[2*head + 1],
+			  execlists->active);
+
+		if (status & (GEN8_CTX_STATUS_IDLE_ACTIVE |
+			      GEN8_CTX_STATUS_PREEMPTED))
+			execlists_set_active(execlists,
+					     EXECLISTS_ACTIVE_HWACK);
+		if (status & GEN8_CTX_STATUS_ACTIVE_IDLE)
+			execlists_clear_active(execlists,
+					       EXECLISTS_ACTIVE_HWACK);
+
+		if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
+			continue;
 
-			head = execlists->csb_head;
-			tail = READ_ONCE(buf[write_idx]);
-		}
-		GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
-			  engine->name,
-			  head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?",
-			  tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?");
+		/* We should never get a COMPLETED | IDLE_ACTIVE! */
+		GEM_BUG_ON(status & GEN8_CTX_STATUS_IDLE_ACTIVE);
 
-		while (head != tail) {
-			struct i915_request *rq;
-			unsigned int status;
-			unsigned int count;
+		if (status & GEN8_CTX_STATUS_COMPLETE &&
+		    buf[2*head + 1] == execlists->preempt_complete_status) {
+			GEM_TRACE("%s preempt-idle\n", engine->name);
+			complete_preempt_context(execlists);
+			continue;
+		}
 
-			if (++head == GEN8_CSB_ENTRIES)
-				head = 0;
+		if (status & GEN8_CTX_STATUS_PREEMPTED &&
+		    execlists_is_active(execlists,
+					EXECLISTS_ACTIVE_PREEMPT))
+			continue;
 
-			/* We are flying near dragons again.
-			 *
-			 * We hold a reference to the request in execlist_port[]
-			 * but no more than that. We are operating in softirq
-			 * context and so cannot hold any mutex or sleep. That
-			 * prevents us stopping the requests we are processing
-			 * in port[] from being retired simultaneously (the
-			 * breadcrumb will be complete before we see the
-			 * context-switch). As we only hold the reference to the
-			 * request, any pointer chasing underneath the request
-			 * is subject to a potential use-after-free. Thus we
-			 * store all of the bookkeeping within port[] as
-			 * required, and avoid using unguarded pointers beneath
-			 * request itself. The same applies to the atomic
-			 * status notifier.
-			 */
+		GEM_BUG_ON(!execlists_is_active(execlists,
+						EXECLISTS_ACTIVE_USER));
 
-			status = READ_ONCE(buf[2 * head]); /* maybe mmio! */
-			GEM_TRACE("%s csb[%d]: status=0x%08x:0x%08x, active=0x%x\n",
-				  engine->name, head,
-				  status, buf[2*head + 1],
-				  execlists->active);
-
-			if (status & (GEN8_CTX_STATUS_IDLE_ACTIVE |
-				      GEN8_CTX_STATUS_PREEMPTED))
-				execlists_set_active(execlists,
-						     EXECLISTS_ACTIVE_HWACK);
-			if (status & GEN8_CTX_STATUS_ACTIVE_IDLE)
-				execlists_clear_active(execlists,
-						       EXECLISTS_ACTIVE_HWACK);
-
-			if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
-				continue;
+		rq = port_unpack(port, &count);
+		GEM_TRACE("%s out[0]: ctx=%d.%d, seqno=%x, prio=%d\n",
+			  engine->name,
+			  port->context_id, count,
+			  rq ? rq->global_seqno : 0,
+			  rq ? rq_prio(rq) : 0);
+
+		/* Check the context/desc id for this event matches */
+		GEM_DEBUG_BUG_ON(buf[2 * head + 1] != port->context_id);
+
+		GEM_BUG_ON(count == 0);
+		if (--count == 0) {
+			GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
+			GEM_BUG_ON(port_isset(&port[1]) &&
+				   !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH));
+			GEM_BUG_ON(!i915_request_completed(rq));
+			execlists_context_schedule_out(rq);
+			trace_i915_request_out(rq);
+			i915_request_put(rq);
+
+			GEM_TRACE("%s completed ctx=%d\n",
+				  engine->name, port->context_id);
+
+			execlists_port_complete(execlists, port);
+		} else {
+			port_set(port, port_pack(rq, count));
+		}
 
-			/* We should never get a COMPLETED | IDLE_ACTIVE! */
-			GEM_BUG_ON(status & GEN8_CTX_STATUS_IDLE_ACTIVE);
+		/* After the final element, the hw should be idle */
+		GEM_BUG_ON(port_count(port) == 0 &&
+			   !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
+		if (port_count(port) == 0)
+			execlists_clear_active(execlists,
+					       EXECLISTS_ACTIVE_USER);
+	}
 
-			if (status & GEN8_CTX_STATUS_COMPLETE &&
-			    buf[2*head + 1] == execlists->preempt_complete_status) {
-				GEM_TRACE("%s preempt-idle\n", engine->name);
-				complete_preempt_context(execlists);
-				continue;
-			}
+	if (head != execlists->csb_head) {
+		execlists->csb_head = head;
+		writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
+		       i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
+	}
 
-			if (status & GEN8_CTX_STATUS_PREEMPTED &&
-			    execlists_is_active(execlists,
-						EXECLISTS_ACTIVE_PREEMPT))
-				continue;
+	if (unlikely(fw))
+		intel_uncore_forcewake_put(i915, execlists->fw_domains);
+}
 
-			GEM_BUG_ON(!execlists_is_active(execlists,
-							EXECLISTS_ACTIVE_USER));
-
-			rq = port_unpack(port, &count);
-			GEM_TRACE("%s out[0]: ctx=%d.%d, seqno=%x, prio=%d\n",
-				  engine->name,
-				  port->context_id, count,
-				  rq ? rq->global_seqno : 0,
-				  rq ? rq_prio(rq) : 0);
-
-			/* Check the context/desc id for this event matches */
-			GEM_DEBUG_BUG_ON(buf[2 * head + 1] != port->context_id);
-
-			GEM_BUG_ON(count == 0);
-			if (--count == 0) {
-				GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
-				GEM_BUG_ON(port_isset(&port[1]) &&
-					   !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH));
-				GEM_BUG_ON(!i915_request_completed(rq));
-				execlists_context_schedule_out(rq);
-				trace_i915_request_out(rq);
-				i915_request_put(rq);
-
-				GEM_TRACE("%s completed ctx=%d\n",
-					  engine->name, port->context_id);
-
-				execlists_port_complete(execlists, port);
-			} else {
-				port_set(port, port_pack(rq, count));
-			}
+/*
+ * Check the unread Context Status Buffers and manage the submission of new
+ * contexts to the ELSP accordingly.
+ */
+static void execlists_submission_tasklet(unsigned long data)
+{
+	struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
 
-			/* After the final element, the hw should be idle */
-			GEM_BUG_ON(port_count(port) == 0 &&
-				   !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
-			if (port_count(port) == 0)
-				execlists_clear_active(execlists,
-						       EXECLISTS_ACTIVE_USER);
-		}
+	/*
+	 * We can skip acquiring intel_runtime_pm_get() here as it was taken
+	 * on our behalf by the request (see i915_gem_mark_busy()) and it will
+	 * not be relinquished until the device is idle (see
+	 * i915_gem_idle_work_handler()). As a precaution, we make sure
+	 * that all ELSP are drained i.e. we have processed the CSB,
+	 * before allowing ourselves to idle and calling intel_runtime_pm_put().
+	 */
+	GEM_BUG_ON(!engine->i915->gt.awake);
 
-		if (head != execlists->csb_head) {
-			execlists->csb_head = head;
-			writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
-			       dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
-		}
-	}
+	/*
+	 * Prefer doing test_and_clear_bit() as a two stage operation to avoid
+	 * imposing the cost of a locked atomic transaction when submitting a
+	 * new request (outside of the context-switch interrupt).
+	 */
+	if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
+		process_csb(engine);
 
-	if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT))
+	if (!execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
 		execlists_dequeue(engine);
-
-	if (fw)
-		intel_uncore_forcewake_put(dev_priv, execlists->fw_domains);
 }
 
 static void queue_request(struct intel_engine_cs *engine,
@@ -1667,6 +1673,7 @@ static struct i915_request *
 execlists_reset_prepare(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
+	struct i915_request *request, *active;
 
 	/*
 	 * Prevent request submission to the hardware until we have
@@ -1688,7 +1695,39 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
 		tasklet_kill(&execlists->tasklet);
 	tasklet_disable(&execlists->tasklet);
 
-	return i915_gem_find_active_request(engine);
+	/*
+	 * We want to flush the pending context switches, having disabled
+	 * the tasklet above, we can assume exclusive access to the execlists.
+	 * For this allows us to catch up with an inflight preemption event,
+	 * and avoid blaming an innocent request if the stall was due to the
+	 * preemption itself.
+	 */
+	process_csb(engine);
+
+	/*
+	 * The last active request can then be no later than the last request
+	 * now in ELSP[0]. So search backwards from there, so that is the GPU
+	 * has advanced beyond the last CSB update, it will be pardoned.
+	 */
+	active = NULL;
+	request = port_request(execlists->port);
+	if (request) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&engine->timeline->lock, flags);
+		list_for_each_entry_from_reverse(request,
+						 &engine->timeline->requests,
+						 link) {
+			if (__i915_request_completed(request,
+						     request->global_seqno))
+				break;
+
+			active = request;
+		}
+		spin_unlock_irqrestore(&engine->timeline->lock, flags);
+	}
+
+	return active;
 }
 
 static void reset_irq(struct intel_engine_cs *engine)
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 6/8] drm/i915: Fix loop on CSB processing
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (4 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 5/8] drm/i915/execlists: Flush pending preemption events during reset jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:33   ` Jeff McGee
  2018-03-21 17:26 ` [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index beb81f13a3cc..cec4e1653daf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1009,7 +1009,7 @@ static void execlists_submission_tasklet(unsigned long data)
 	 * imposing the cost of a locked atomic transaction when submitting a
 	 * new request (outside of the context-switch interrupt).
 	 */
-	if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
+	while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
 		process_csb(engine);
 
 	if (!execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (5 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 6/8] drm/i915: Fix loop on CSB processing jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 17:31   ` Jeff McGee
  2018-03-21 17:26 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Engine reset is fast. A context switch interrupt may be generated just
prior to the reset such that the top half handler is racing with reset
post-processing. The handler may set the irq_posted bit again after
the reset code has cleared it to start fresh. Then the re-enabled
tasklet will read the CSB head and tail from MMIO, which will be at
the hardware reset values of 0 and 7 respectively, given that no
actual CSB event has occurred since the reset. Mayhem then ensues as
the tasklet starts processing invalid CSB entries.

We can handle this corner case without adding any new synchronization
between the irq top half and the reset work item. The tasklet can
just skip CSB processing if the tail is not sane.

Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index cec4e1653daf..d420c2ecb50a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -865,6 +865,14 @@ static void process_csb(struct intel_engine_cs *engine)
 		head = readl(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 		tail = GEN8_CSB_WRITE_PTR(head);
 		head = GEN8_CSB_READ_PTR(head);
+
+		/* The MMIO read CSB tail may be at the reset value of
+		 * 0x7 if there hasn't been a valid CSB event since
+		 * the engine reset.
+		 */
+		if (tail >= GEN8_CSB_ENTRIES)
+			goto out;
+
 		execlists->csb_head = head;
 	} else {
 		const int write_idx =
@@ -873,6 +881,7 @@ static void process_csb(struct intel_engine_cs *engine)
 
 		head = execlists->csb_head;
 		tail = READ_ONCE(buf[write_idx]);
+		GEM_BUG_ON(tail >= GEN8_CSB_ENTRIES);
 	}
 	GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
 		  engine->name,
@@ -981,7 +990,7 @@ static void process_csb(struct intel_engine_cs *engine)
 		writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
 		       i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 	}
-
+out:
 	if (unlikely(fw))
 		intel_uncore_forcewake_put(i915, execlists->fw_domains);
 }
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (6 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-21 17:26 ` jeff.mcgee
  2018-03-21 18:50 ` ✗ Fi.CI.BAT: failure for Force preemption (rev2) Patchwork
  2018-03-22  9:22 ` [RFC 0/8] Force preemption Tvrtko Ursulin
  9 siblings, 0 replies; 29+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The hardware can complete the requested preemption at only certain points
in execution. Thus an uncooperative request that avoids those points can
block a preemption indefinitely. Our only option to bound the preemption
latency is to trigger reset and recovery just as we would if a request had
hung the hardware. This is so-called forced preemption. This change adds
that capability as an option for systems with extremely strict scheduling
latency requirements for its high priority requests. This option must be
used with great care. The force-preempted requests will be discarded at
the point of reset, resulting in various degrees of disruption to the
owning application up to and including crash.

The option is enabled by specifying a non-zero value for new i915 module
parameter fpreempt_timeout. This value becomes the time in milliseconds
after initiation of preemption at which the reset is triggered if the
preemption has not completed normally.

GuC mode is not supported. The fpreempt_timeout parameter is simply
ignored.

Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c         | 27 ++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_params.c      |  3 +++
 drivers/gpu/drm/i915/i915_params.h      |  1 +
 drivers/gpu/drm/i915/intel_engine_cs.c  | 40 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 13 +++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++++
 6 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 38f7160d99c9..0ad448792bfb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2896,8 +2896,24 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return active;
 }
 
-static bool engine_stalled(struct intel_engine_cs *engine)
+static bool engine_stalled(struct intel_engine_cs *engine,
+			   struct i915_request *request)
 {
+	if (engine->fpreempt_stalled) {
+		/* Pardon the request if it managed to yield the
+		 * engine by completing just prior to the reset. We
+		 * could be even more sophisticated here and pardon
+		 * the request if it preempted out (mid-batch) prior
+		 * to the reset, but that's not so straight-forward
+		 * to detect. Perhaps not worth splitting hairs when
+		 * the request had clearly behaved badly to get here.
+		 */
+		if (i915_request_completed(request))
+			return false;
+
+		return true;
+	}
+
 	if (!engine->hangcheck.stalled)
 		return false;
 
@@ -3038,7 +3054,7 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 	 * subsequent hangs.
 	 */
 
-	if (engine_stalled(engine)) {
+	if (engine_stalled(engine, request)) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
 
@@ -3046,6 +3062,13 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 		if (i915_gem_context_is_banned(request->ctx))
 			engine_skip_context(request);
 	} else {
+		/* If the request that we just pardoned was the target of a
+		 * force preemption there is no possibility of the next
+		 * request in line having started.
+		 */
+		if (engine->fpreempt_stalled)
+			return NULL;
+
 		/*
 		 * Since this is not the hung engine, it may have advanced
 		 * since the hang declaration. Double check by refinding
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 08108ce5be21..71bc8213acb2 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -178,6 +178,9 @@ i915_param_named(enable_dpcd_backlight, bool, 0600,
 i915_param_named(enable_gvt, bool, 0400,
 	"Enable support for Intel GVT-g graphics virtualization host support(default:false)");
 
+i915_param_named(fpreempt_timeout, uint, 0600,
+	"Wait time in msecs before forcing a preemption with reset (0:never force [default])");
+
 static __always_inline void _print_param(struct drm_printer *p,
 					 const char *name,
 					 const char *type,
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index c96360398072..1d50f223b637 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -55,6 +55,7 @@ struct drm_printer;
 	param(int, edp_vswing, 0) \
 	param(int, reset, 2) \
 	param(unsigned int, inject_load_failure, 0) \
+	param(unsigned int, fpreempt_timeout, 0) \
 	/* leave bools at the end to not create holes */ \
 	param(bool, alpha_support, IS_ENABLED(CONFIG_DRM_I915_ALPHA_SUPPORT)) \
 	param(bool, enable_cmd_parser, true) \
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 337dfa56a738..81358afd1957 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -493,6 +493,45 @@ static void intel_engine_init_execlist(struct intel_engine_cs *engine)
 	execlists->first = NULL;
 }
 
+static enum hrtimer_restart
+intel_engine_fpreempt_timer(struct hrtimer *hrtimer)
+{
+	struct intel_engine_cs *engine =
+		container_of(hrtimer, struct intel_engine_cs,
+			     fpreempt_timer);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+		queue_work(system_highpri_wq, &engine->fpreempt_work);
+
+	return HRTIMER_NORESTART;
+}
+
+static void intel_engine_fpreempt_work(struct work_struct *work)
+{
+	struct intel_engine_cs *engine =
+		container_of(work, struct intel_engine_cs,
+			     fpreempt_work);
+
+	tasklet_kill(&engine->execlists.tasklet);
+	tasklet_disable(&engine->execlists.tasklet);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT)) {
+		engine->fpreempt_stalled = true;
+		i915_handle_error(engine->i915, intel_engine_flag(engine),
+				  0, "force preemption");
+	}
+
+	tasklet_enable(&engine->execlists.tasklet);
+}
+
+static void intel_engine_init_fpreempt(struct intel_engine_cs *engine)
+{
+	hrtimer_init(&engine->fpreempt_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	engine->fpreempt_timer.function = intel_engine_fpreempt_timer;
+	INIT_WORK(&engine->fpreempt_work, intel_engine_fpreempt_work);
+}
+
 /**
  * intel_engines_setup_common - setup engine state not requiring hw access
  * @engine: Engine to setup.
@@ -508,6 +547,7 @@ void intel_engine_setup_common(struct intel_engine_cs *engine)
 	intel_engine_init_timeline(engine);
 	intel_engine_init_hangcheck(engine);
 	intel_engine_init_batch_pool(engine);
+	intel_engine_init_fpreempt(engine);
 	intel_engine_init_cmd_parser(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index d420c2ecb50a..0de687856434 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -533,15 +533,25 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
 
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
 	execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+	if (i915_modparams.fpreempt_timeout)
+		hrtimer_start(&engine->fpreempt_timer,
+			      ms_to_ktime(i915_modparams.fpreempt_timeout),
+			      HRTIMER_MODE_REL);
 }
 
 static void complete_preempt_context(struct intel_engine_execlists *execlists)
 {
+	struct intel_engine_cs *engine =
+		container_of(execlists, struct intel_engine_cs, execlists);
+
 	execlists_cancel_port_requests(execlists);
 	execlists_unwind_incomplete_requests(execlists);
 
 	GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+	hrtimer_try_to_cancel(&engine->fpreempt_timer);
 }
 
 static void execlists_dequeue(struct intel_engine_cs *engine)
@@ -1844,6 +1854,9 @@ static void execlists_reset(struct intel_engine_cs *engine,
 
 static void execlists_reset_finish(struct intel_engine_cs *engine)
 {
+	/* Mark any force preemption as resolved */
+	engine->fpreempt_stalled = false;
+
 	tasklet_enable(&engine->execlists.tasklet);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index e2681303ce21..51286796def7 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -565,6 +565,10 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	struct hrtimer fpreempt_timer;
+	struct work_struct fpreempt_work;
+	bool fpreempt_stalled;
+
 #define I915_ENGINE_NEEDS_CMD_PARSER BIT(0)
 #define I915_ENGINE_SUPPORTS_STATS   BIT(1)
 	unsigned int flags;
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-21 17:26 ` [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-21 17:31   ` Jeff McGee
  2018-03-21 18:12     ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff McGee @ 2018-03-21 17:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: kalyan.kondapally, ben

On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Engine reset is fast. A context switch interrupt may be generated just
> prior to the reset such that the top half handler is racing with reset
> post-processing. The handler may set the irq_posted bit again after
> the reset code has cleared it to start fresh. Then the re-enabled
> tasklet will read the CSB head and tail from MMIO, which will be at
> the hardware reset values of 0 and 7 respectively, given that no
> actual CSB event has occurred since the reset. Mayhem then ensues as
> the tasklet starts processing invalid CSB entries.
> 
> We can handle this corner case without adding any new synchronization
> between the irq top half and the reset work item. The tasklet can
> just skip CSB processing if the tail is not sane.
> 
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
I will see irq_posted get set after reset which causes the first tasklet
run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
was active.
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 6/8] drm/i915: Fix loop on CSB processing
  2018-03-21 17:26 ` [RFC 6/8] drm/i915: Fix loop on CSB processing jeff.mcgee
@ 2018-03-21 17:33   ` Jeff McGee
  2018-03-21 18:06     ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff McGee @ 2018-03-21 17:33 UTC (permalink / raw)
  To: intel-gfx; +Cc: kalyan.kondapally, ben

On Wed, Mar 21, 2018 at 10:26:23AM -0700, jeff.mcgee@intel.com wrote:
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index beb81f13a3cc..cec4e1653daf 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1009,7 +1009,7 @@ static void execlists_submission_tasklet(unsigned long data)
>  	 * imposing the cost of a locked atomic transaction when submitting a
>  	 * new request (outside of the context-switch interrupt).
>  	 */
> -	if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> +	while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
Assuming that this accidentally went missing in the refactor. Chris?

>  		process_csb(engine);
>  
>  	if (!execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
> -- 
> 2.16.2
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 6/8] drm/i915: Fix loop on CSB processing
  2018-03-21 17:33   ` Jeff McGee
@ 2018-03-21 18:06     ` Chris Wilson
  2018-03-21 18:29       ` Jeff McGee
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2018-03-21 18:06 UTC (permalink / raw)
  To: Jeff McGee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Jeff McGee (2018-03-21 17:33:04)
> On Wed, Mar 21, 2018 at 10:26:23AM -0700, jeff.mcgee@intel.com wrote:
> > From: Jeff McGee <jeff.mcgee@intel.com>
> > 
> > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index beb81f13a3cc..cec4e1653daf 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1009,7 +1009,7 @@ static void execlists_submission_tasklet(unsigned long data)
> >        * imposing the cost of a locked atomic transaction when submitting a
> >        * new request (outside of the context-switch interrupt).
> >        */
> > -     if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> > +     while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> Assuming that this accidentally went missing in the refactor. Chris?

No. process_csb became a do{} while. The caller did a test_bit to avoid
the function call for normal rescheduling paths.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-21 17:31   ` Jeff McGee
@ 2018-03-21 18:12     ` Chris Wilson
  2018-03-21 19:06       ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2018-03-21 18:12 UTC (permalink / raw)
  To: Jeff McGee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Jeff McGee (2018-03-21 17:31:45)
> On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> > From: Jeff McGee <jeff.mcgee@intel.com>
> > 
> > Engine reset is fast. A context switch interrupt may be generated just
> > prior to the reset such that the top half handler is racing with reset
> > post-processing. The handler may set the irq_posted bit again after
> > the reset code has cleared it to start fresh. Then the re-enabled
> > tasklet will read the CSB head and tail from MMIO, which will be at
> > the hardware reset values of 0 and 7 respectively, given that no
> > actual CSB event has occurred since the reset. Mayhem then ensues as
> > the tasklet starts processing invalid CSB entries.
> > 
> > We can handle this corner case without adding any new synchronization
> > between the irq top half and the reset work item. The tasklet can
> > just skip CSB processing if the tail is not sane.
> > 
> > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > ---
> If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
> I will see irq_posted get set after reset which causes the first tasklet
> run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
> was active.

However, for reset+sync to be followed by an interrupt is surprising.
What more do we need to do after the reset to flush the last interrupt?

What I've been trying to get right is disabling the RING_IMR across the
reset so that we do not get any new interrupts generated for this engine.
(So couple that also with a sync_irq.) We've been kicking around that
plan for yonks, since the beginning of doing per-engine reset.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 6/8] drm/i915: Fix loop on CSB processing
  2018-03-21 18:06     ` Chris Wilson
@ 2018-03-21 18:29       ` Jeff McGee
  2018-03-21 19:04         ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff McGee @ 2018-03-21 18:29 UTC (permalink / raw)
  To: Chris Wilson; +Cc: kalyan.kondapally, intel-gfx, ben

On Wed, Mar 21, 2018 at 06:06:44PM +0000, Chris Wilson wrote:
> Quoting Jeff McGee (2018-03-21 17:33:04)
> > On Wed, Mar 21, 2018 at 10:26:23AM -0700, jeff.mcgee@intel.com wrote:
> > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > 
> > > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > index beb81f13a3cc..cec4e1653daf 100644
> > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > @@ -1009,7 +1009,7 @@ static void execlists_submission_tasklet(unsigned long data)
> > >        * imposing the cost of a locked atomic transaction when submitting a
> > >        * new request (outside of the context-switch interrupt).
> > >        */
> > > -     if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> > > +     while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> > Assuming that this accidentally went missing in the refactor. Chris?
> 
> No. process_csb became a do{} while. The caller did a test_bit to avoid
> the function call for normal rescheduling paths.
> -Chris

But there is no loop in process_csb().
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* ✗ Fi.CI.BAT: failure for Force preemption (rev2)
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (7 preceding siblings ...)
  2018-03-21 17:26 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
@ 2018-03-21 18:50 ` Patchwork
  2018-03-22  9:22 ` [RFC 0/8] Force preemption Tvrtko Ursulin
  9 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2018-03-21 18:50 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: Force preemption (rev2)
URL   : https://patchwork.freedesktop.org/series/40120/
State : failure

== Summary ==

Applying: drm/i915/execlists: Refactor out complete_preempt_context()
Applying: drm/i915: Add control flags to i915_handle_error()
error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/i915_debugfs.c
M	drivers/gpu/drm/i915/i915_drv.c
M	drivers/gpu/drm/i915/i915_drv.h
M	drivers/gpu/drm/i915/i915_gpu_error.h
M	drivers/gpu/drm/i915/i915_irq.c
M	drivers/gpu/drm/i915/i915_request.c
M	drivers/gpu/drm/i915/intel_hangcheck.c
M	drivers/gpu/drm/i915/selftests/intel_hangcheck.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/intel_hangcheck.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_hangcheck.c
Patch failed at 0002 drm/i915: Add control flags to i915_handle_error()
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 6/8] drm/i915: Fix loop on CSB processing
  2018-03-21 18:29       ` Jeff McGee
@ 2018-03-21 19:04         ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2018-03-21 19:04 UTC (permalink / raw)
  To: Jeff McGee; +Cc: kalyan.kondapally, intel-gfx, ben

Quoting Jeff McGee (2018-03-21 18:29:46)
> On Wed, Mar 21, 2018 at 06:06:44PM +0000, Chris Wilson wrote:
> > Quoting Jeff McGee (2018-03-21 17:33:04)
> > > On Wed, Mar 21, 2018 at 10:26:23AM -0700, jeff.mcgee@intel.com wrote:
> > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > 
> > > > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > > > index beb81f13a3cc..cec4e1653daf 100644
> > > > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > > > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > > > @@ -1009,7 +1009,7 @@ static void execlists_submission_tasklet(unsigned long data)
> > > >        * imposing the cost of a locked atomic transaction when submitting a
> > > >        * new request (outside of the context-switch interrupt).
> > > >        */
> > > > -     if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> > > > +     while (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
> > > Assuming that this accidentally went missing in the refactor. Chris?
> > 
> > No. process_csb became a do{} while. The caller did a test_bit to avoid
> > the function call for normal rescheduling paths.
> > -Chris
> 
> But there is no loop in process_csb().

There is in my copy. I was trying different things and removing the loop
masked a different issue with tasklet scheduling. Strictly we don't need
a loop here as we will always be re-run following an interrupt. But since
we may need to grab forcewake and whatnot, it seems prudent to keep the
loop around.
-Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-21 18:12     ` Chris Wilson
@ 2018-03-21 19:06       ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2018-03-21 19:06 UTC (permalink / raw)
  To: Jeff McGee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Chris Wilson (2018-03-21 18:12:51)
> Quoting Jeff McGee (2018-03-21 17:31:45)
> > On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > 
> > > Engine reset is fast. A context switch interrupt may be generated just
> > > prior to the reset such that the top half handler is racing with reset
> > > post-processing. The handler may set the irq_posted bit again after
> > > the reset code has cleared it to start fresh. Then the re-enabled
> > > tasklet will read the CSB head and tail from MMIO, which will be at
> > > the hardware reset values of 0 and 7 respectively, given that no
> > > actual CSB event has occurred since the reset. Mayhem then ensues as
> > > the tasklet starts processing invalid CSB entries.
> > > 
> > > We can handle this corner case without adding any new synchronization
> > > between the irq top half and the reset work item. The tasklet can
> > > just skip CSB processing if the tail is not sane.
> > > 
> > > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > > ---
> > If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
> > I will see irq_posted get set after reset which causes the first tasklet
> > run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
> > was active.
> 
> However, for reset+sync to be followed by an interrupt is surprising.
> What more do we need to do after the reset to flush the last interrupt?

Actually, it may not be a late interrupt, just a late cacheline flush
from one processor to another. __set_bit bites.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
                   ` (8 preceding siblings ...)
  2018-03-21 18:50 ` ✗ Fi.CI.BAT: failure for Force preemption (rev2) Patchwork
@ 2018-03-22  9:22 ` Tvrtko Ursulin
  2018-03-22  9:28   ` Chris Wilson
  9 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22  9:22 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben


On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

Sorry if it was mentioned elsewhere and I missed it - but does this work 
only with stateless clients - or in other words, what would happen to 
stateful clients which would be force preempted? Or the answer is we 
don't care since they are misbehaving?

Maybe ban such contexts on first force preemption would make sense for 
this specific target environment?

Regards,

Tvrtko

> 
> Chris Wilson (5):
>    drm/i915/execlists: Refactor out complete_preempt_context()
>    drm/i915: Add control flags to i915_handle_error()
>    drm/i915: Move engine reset prepare/finish to backends
>    drm/i915: Split execlists/guc reset prepartions
>    drm/i915/execlists: Flush pending preemption events during reset
> 
> Jeff McGee (3):
>    drm/i915: Fix loop on CSB processing
>    drm/i915: Skip CSB processing on invalid CSB tail
>    drm/i915: Force preemption to complete via engine reset
> 
>   drivers/gpu/drm/i915/i915_debugfs.c              |   4 +-
>   drivers/gpu/drm/i915/i915_drv.c                  |  17 +-
>   drivers/gpu/drm/i915/i915_drv.h                  |  10 +-
>   drivers/gpu/drm/i915/i915_gem.c                  |  69 ++--
>   drivers/gpu/drm/i915/i915_gpu_error.h            |   3 +
>   drivers/gpu/drm/i915/i915_irq.c                  |  55 +--
>   drivers/gpu/drm/i915/i915_params.c               |   3 +
>   drivers/gpu/drm/i915/i915_params.h               |   1 +
>   drivers/gpu/drm/i915/i915_request.c              |   2 +-
>   drivers/gpu/drm/i915/intel_engine_cs.c           |  40 +++
>   drivers/gpu/drm/i915/intel_guc_submission.c      |  39 ++
>   drivers/gpu/drm/i915/intel_hangcheck.c           |   8 +-
>   drivers/gpu/drm/i915/intel_lrc.c                 | 436 ++++++++++++++---------
>   drivers/gpu/drm/i915/intel_ringbuffer.c          |  20 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.h          |  13 +-
>   drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  13 +-
>   16 files changed, 469 insertions(+), 264 deletions(-)
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22  9:22 ` [RFC 0/8] Force preemption Tvrtko Ursulin
@ 2018-03-22  9:28   ` Chris Wilson
  2018-03-22 14:34     ` Jeff McGee
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2018-03-22  9:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> 
> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > From: Jeff McGee <jeff.mcgee@intel.com>
> > 
> > Force preemption uses engine reset to enforce a limit on the time
> > that a request targeted for preemption can block. This feature is
> > a requirement in automotive systems where the GPU may be shared by
> > clients of critically high priority and clients of low priority that
> > may not have been curated to be preemption friendly. There may be
> > more general applications of this feature. I'm sharing as an RFC to
> > stimulate that discussion and also to get any technical feedback
> > that I can before submitting to the product kernel that needs this.
> > I have developed the patches for ease of rebase, given that this is
> > for the moment considered a non-upstreamable feature. It would be
> > possible to refactor hangcheck to fully incorporate force preemption
> > as another tier of patience (or impatience) with the running request.
> 
> Sorry if it was mentioned elsewhere and I missed it - but does this work 
> only with stateless clients - or in other words, what would happen to 
> stateful clients which would be force preempted? Or the answer is we 
> don't care since they are misbehaving?

They get notified of being guilty for causing a gpu reset; three strikes
and they are out (banned from using the gpu) using the current rules.
This is a very blunt hammer that requires the rest of the system to be
robust; one might argue time spent making the system robust would be
better served making sure that the timer never expired in the first place
thereby eliminating the need for a forced gpu reset.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22  9:28   ` Chris Wilson
@ 2018-03-22 14:34     ` Jeff McGee
  2018-03-22 15:35       ` Chris Wilson
  2018-03-22 15:57       ` Tvrtko Ursulin
  0 siblings, 2 replies; 29+ messages in thread
From: Jeff McGee @ 2018-03-22 14:34 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, ben, kalyan.kondapally

On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > 
> > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > 
> > > Force preemption uses engine reset to enforce a limit on the time
> > > that a request targeted for preemption can block. This feature is
> > > a requirement in automotive systems where the GPU may be shared by
> > > clients of critically high priority and clients of low priority that
> > > may not have been curated to be preemption friendly. There may be
> > > more general applications of this feature. I'm sharing as an RFC to
> > > stimulate that discussion and also to get any technical feedback
> > > that I can before submitting to the product kernel that needs this.
> > > I have developed the patches for ease of rebase, given that this is
> > > for the moment considered a non-upstreamable feature. It would be
> > > possible to refactor hangcheck to fully incorporate force preemption
> > > as another tier of patience (or impatience) with the running request.
> > 
> > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > only with stateless clients - or in other words, what would happen to 
> > stateful clients which would be force preempted? Or the answer is we 
> > don't care since they are misbehaving?
> 
> They get notified of being guilty for causing a gpu reset; three strikes
> and they are out (banned from using the gpu) using the current rules.
> This is a very blunt hammer that requires the rest of the system to be
> robust; one might argue time spent making the system robust would be
> better served making sure that the timer never expired in the first place
> thereby eliminating the need for a forced gpu reset.
> -Chris

Yes, for simplication the policy applied to force preempted contexts
is the same as for hanging contexts. It is known that this feature
should not be required in a fully curated system. It's a requirement
if end user will be alllowed to install 3rd party apps to run in the
non-critical domain.
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 14:34     ` Jeff McGee
@ 2018-03-22 15:35       ` Chris Wilson
  2018-03-22 15:44         ` Jeff McGee
  2018-03-22 15:57       ` Tvrtko Ursulin
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2018-03-22 15:35 UTC (permalink / raw)
  To: Jeff McGee; +Cc: intel-gfx, ben, kalyan.kondapally

Quoting Jeff McGee (2018-03-22 14:34:58)
> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > 
> > > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > 
> > > > Force preemption uses engine reset to enforce a limit on the time
> > > > that a request targeted for preemption can block. This feature is
> > > > a requirement in automotive systems where the GPU may be shared by
> > > > clients of critically high priority and clients of low priority that
> > > > may not have been curated to be preemption friendly. There may be
> > > > more general applications of this feature. I'm sharing as an RFC to
> > > > stimulate that discussion and also to get any technical feedback
> > > > that I can before submitting to the product kernel that needs this.
> > > > I have developed the patches for ease of rebase, given that this is
> > > > for the moment considered a non-upstreamable feature. It would be
> > > > possible to refactor hangcheck to fully incorporate force preemption
> > > > as another tier of patience (or impatience) with the running request.
> > > 
> > > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > > only with stateless clients - or in other words, what would happen to 
> > > stateful clients which would be force preempted? Or the answer is we 
> > > don't care since they are misbehaving?
> > 
> > They get notified of being guilty for causing a gpu reset; three strikes
> > and they are out (banned from using the gpu) using the current rules.
> > This is a very blunt hammer that requires the rest of the system to be
> > robust; one might argue time spent making the system robust would be
> > better served making sure that the timer never expired in the first place
> > thereby eliminating the need for a forced gpu reset.
> > -Chris
> 
> Yes, for simplication the policy applied to force preempted contexts
> is the same as for hanging contexts. It is known that this feature
> should not be required in a fully curated system. It's a requirement
> if end user will be alllowed to install 3rd party apps to run in the
> non-critical domain.

Third party code is still mediated by our userspace drivers, or are you
contemplating scenarios where they talk directly to ioctls? How hostile
do we have to contend with, i.e. survive a gpu fork bomb?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 15:35       ` Chris Wilson
@ 2018-03-22 15:44         ` Jeff McGee
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff McGee @ 2018-03-22 15:44 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, ben, kalyan.kondapally

On Thu, Mar 22, 2018 at 03:35:19PM +0000, Chris Wilson wrote:
> Quoting Jeff McGee (2018-03-22 14:34:58)
> > On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > > Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > > 
> > > > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > > 
> > > > > Force preemption uses engine reset to enforce a limit on the time
> > > > > that a request targeted for preemption can block. This feature is
> > > > > a requirement in automotive systems where the GPU may be shared by
> > > > > clients of critically high priority and clients of low priority that
> > > > > may not have been curated to be preemption friendly. There may be
> > > > > more general applications of this feature. I'm sharing as an RFC to
> > > > > stimulate that discussion and also to get any technical feedback
> > > > > that I can before submitting to the product kernel that needs this.
> > > > > I have developed the patches for ease of rebase, given that this is
> > > > > for the moment considered a non-upstreamable feature. It would be
> > > > > possible to refactor hangcheck to fully incorporate force preemption
> > > > > as another tier of patience (or impatience) with the running request.
> > > > 
> > > > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > > > only with stateless clients - or in other words, what would happen to 
> > > > stateful clients which would be force preempted? Or the answer is we 
> > > > don't care since they are misbehaving?
> > > 
> > > They get notified of being guilty for causing a gpu reset; three strikes
> > > and they are out (banned from using the gpu) using the current rules.
> > > This is a very blunt hammer that requires the rest of the system to be
> > > robust; one might argue time spent making the system robust would be
> > > better served making sure that the timer never expired in the first place
> > > thereby eliminating the need for a forced gpu reset.
> > > -Chris
> > 
> > Yes, for simplication the policy applied to force preempted contexts
> > is the same as for hanging contexts. It is known that this feature
> > should not be required in a fully curated system. It's a requirement
> > if end user will be alllowed to install 3rd party apps to run in the
> > non-critical domain.
> 
> Third party code is still mediated by our userspace drivers, or are you
> contemplating scenarios where they talk directly to ioctls? How hostile
> do we have to contend with, i.e. survive a gpu fork bomb?
> -Chris

User space driver has very little influence over the preemptibility of
the user space app. The most it can do is enable the finest granularity,
e.g. 3D object level. But app can still submit a single giant triangle
with uber pixelshader. Nothing overtly wrong with that, but it is not
mid-batch preemptible on our hardware.
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 14:34     ` Jeff McGee
  2018-03-22 15:35       ` Chris Wilson
@ 2018-03-22 15:57       ` Tvrtko Ursulin
  2018-03-22 16:01         ` Jeff McGee
  1 sibling, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22 15:57 UTC (permalink / raw)
  To: Jeff McGee, Chris Wilson; +Cc: kalyan.kondapally, intel-gfx, ben


On 22/03/2018 14:34, Jeff McGee wrote:
> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
>>>
>>> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
>>>> From: Jeff McGee <jeff.mcgee@intel.com>
>>>>
>>>> Force preemption uses engine reset to enforce a limit on the time
>>>> that a request targeted for preemption can block. This feature is
>>>> a requirement in automotive systems where the GPU may be shared by
>>>> clients of critically high priority and clients of low priority that
>>>> may not have been curated to be preemption friendly. There may be
>>>> more general applications of this feature. I'm sharing as an RFC to
>>>> stimulate that discussion and also to get any technical feedback
>>>> that I can before submitting to the product kernel that needs this.
>>>> I have developed the patches for ease of rebase, given that this is
>>>> for the moment considered a non-upstreamable feature. It would be
>>>> possible to refactor hangcheck to fully incorporate force preemption
>>>> as another tier of patience (or impatience) with the running request.
>>>
>>> Sorry if it was mentioned elsewhere and I missed it - but does this work
>>> only with stateless clients - or in other words, what would happen to
>>> stateful clients which would be force preempted? Or the answer is we
>>> don't care since they are misbehaving?
>>
>> They get notified of being guilty for causing a gpu reset; three strikes
>> and they are out (banned from using the gpu) using the current rules.
>> This is a very blunt hammer that requires the rest of the system to be
>> robust; one might argue time spent making the system robust would be
>> better served making sure that the timer never expired in the first place
>> thereby eliminating the need for a forced gpu reset.
>> -Chris
> 
> Yes, for simplication the policy applied to force preempted contexts
> is the same as for hanging contexts. It is known that this feature
> should not be required in a fully curated system. It's a requirement
> if end user will be alllowed to install 3rd party apps to run in the
> non-critical domain.

My concern is whether it safe to call this force _preemption_, while it 
is not really expected to work as preemption from the point of view of 
preempted context. I may be missing some angle here, but I think a 
better name would include words like maximum request duration or something.

I can see a difference between allowed maximum duration when there is 
something else pending, and when it isn't, but I don't immediately see 
that we should consider this distinction for any real benefit?

So should the feature just be "maximum request duration"? This would 
perhaps make it just a special case of hangcheck, which ignores head 
progress, or whatever we do in there.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 15:57       ` Tvrtko Ursulin
@ 2018-03-22 16:01         ` Jeff McGee
  2018-03-22 17:41           ` Tvrtko Ursulin
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff McGee @ 2018-03-22 16:01 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: kalyan.kondapally, intel-gfx, ben

On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> 
> On 22/03/2018 14:34, Jeff McGee wrote:
> >On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> >>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> >>>
> >>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> >>>>From: Jeff McGee <jeff.mcgee@intel.com>
> >>>>
> >>>>Force preemption uses engine reset to enforce a limit on the time
> >>>>that a request targeted for preemption can block. This feature is
> >>>>a requirement in automotive systems where the GPU may be shared by
> >>>>clients of critically high priority and clients of low priority that
> >>>>may not have been curated to be preemption friendly. There may be
> >>>>more general applications of this feature. I'm sharing as an RFC to
> >>>>stimulate that discussion and also to get any technical feedback
> >>>>that I can before submitting to the product kernel that needs this.
> >>>>I have developed the patches for ease of rebase, given that this is
> >>>>for the moment considered a non-upstreamable feature. It would be
> >>>>possible to refactor hangcheck to fully incorporate force preemption
> >>>>as another tier of patience (or impatience) with the running request.
> >>>
> >>>Sorry if it was mentioned elsewhere and I missed it - but does this work
> >>>only with stateless clients - or in other words, what would happen to
> >>>stateful clients which would be force preempted? Or the answer is we
> >>>don't care since they are misbehaving?
> >>
> >>They get notified of being guilty for causing a gpu reset; three strikes
> >>and they are out (banned from using the gpu) using the current rules.
> >>This is a very blunt hammer that requires the rest of the system to be
> >>robust; one might argue time spent making the system robust would be
> >>better served making sure that the timer never expired in the first place
> >>thereby eliminating the need for a forced gpu reset.
> >>-Chris
> >
> >Yes, for simplication the policy applied to force preempted contexts
> >is the same as for hanging contexts. It is known that this feature
> >should not be required in a fully curated system. It's a requirement
> >if end user will be alllowed to install 3rd party apps to run in the
> >non-critical domain.
> 
> My concern is whether it safe to call this force _preemption_, while
> it is not really expected to work as preemption from the point of
> view of preempted context. I may be missing some angle here, but I
> think a better name would include words like maximum request
> duration or something.
> 
> I can see a difference between allowed maximum duration when there
> is something else pending, and when it isn't, but I don't
> immediately see that we should consider this distinction for any
> real benefit?
> 
> So should the feature just be "maximum request duration"? This would
> perhaps make it just a special case of hangcheck, which ignores head
> progress, or whatever we do in there.
> 
> Regards,
> 
> Tvrtko

I think you might be unclear about how this works. We're not starting a
preemption to see if we can cleanly remove a request who has begun to
exceed its normal time slice, i.e. hangcheck. This is about bounding
the time that a normal preemption can take. So first start preemption
in response to higher-priority request arrival, then wait for preemption
to complete within a certain amount of time. If it does not, resort to
reset.

So it's really "force the resolution of a preemption", shortened to
"force preemption".
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 16:01         ` Jeff McGee
@ 2018-03-22 17:41           ` Tvrtko Ursulin
  2018-03-22 19:08             ` Jeff McGee
  0 siblings, 1 reply; 29+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22 17:41 UTC (permalink / raw)
  To: Jeff McGee; +Cc: kalyan.kondapally, intel-gfx, ben


On 22/03/2018 16:01, Jeff McGee wrote:
> On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
>>
>> On 22/03/2018 14:34, Jeff McGee wrote:
>>> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
>>>> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
>>>>>
>>>>> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
>>>>>> From: Jeff McGee <jeff.mcgee@intel.com>
>>>>>>
>>>>>> Force preemption uses engine reset to enforce a limit on the time
>>>>>> that a request targeted for preemption can block. This feature is
>>>>>> a requirement in automotive systems where the GPU may be shared by
>>>>>> clients of critically high priority and clients of low priority that
>>>>>> may not have been curated to be preemption friendly. There may be
>>>>>> more general applications of this feature. I'm sharing as an RFC to
>>>>>> stimulate that discussion and also to get any technical feedback
>>>>>> that I can before submitting to the product kernel that needs this.
>>>>>> I have developed the patches for ease of rebase, given that this is
>>>>>> for the moment considered a non-upstreamable feature. It would be
>>>>>> possible to refactor hangcheck to fully incorporate force preemption
>>>>>> as another tier of patience (or impatience) with the running request.
>>>>>
>>>>> Sorry if it was mentioned elsewhere and I missed it - but does this work
>>>>> only with stateless clients - or in other words, what would happen to
>>>>> stateful clients which would be force preempted? Or the answer is we
>>>>> don't care since they are misbehaving?
>>>>
>>>> They get notified of being guilty for causing a gpu reset; three strikes
>>>> and they are out (banned from using the gpu) using the current rules.
>>>> This is a very blunt hammer that requires the rest of the system to be
>>>> robust; one might argue time spent making the system robust would be
>>>> better served making sure that the timer never expired in the first place
>>>> thereby eliminating the need for a forced gpu reset.
>>>> -Chris
>>>
>>> Yes, for simplication the policy applied to force preempted contexts
>>> is the same as for hanging contexts. It is known that this feature
>>> should not be required in a fully curated system. It's a requirement
>>> if end user will be alllowed to install 3rd party apps to run in the
>>> non-critical domain.
>>
>> My concern is whether it safe to call this force _preemption_, while
>> it is not really expected to work as preemption from the point of
>> view of preempted context. I may be missing some angle here, but I
>> think a better name would include words like maximum request
>> duration or something.
>>
>> I can see a difference between allowed maximum duration when there
>> is something else pending, and when it isn't, but I don't
>> immediately see that we should consider this distinction for any
>> real benefit?
>>
>> So should the feature just be "maximum request duration"? This would
>> perhaps make it just a special case of hangcheck, which ignores head
>> progress, or whatever we do in there.
>>
>> Regards,
>>
>> Tvrtko
> 
> I think you might be unclear about how this works. We're not starting a
> preemption to see if we can cleanly remove a request who has begun to
> exceed its normal time slice, i.e. hangcheck. This is about bounding
> the time that a normal preemption can take. So first start preemption
> in response to higher-priority request arrival, then wait for preemption
> to complete within a certain amount of time. If it does not, resort to
> reset.
> 
> So it's really "force the resolution of a preemption", shortened to
> "force preemption".

You are right, I veered off in my thinking and ended up with something 
different. :)

I however still think the name is potentially misleading, since the 
request/context is not getting preempted. It is getting effectively 
killed (sooner or later, directly or indirectly).

Maybe that is OK for the specific use case when everything is only 
broken and not malicious.

In a more general purpose system it would be a bit random when something 
would work, and when it wouldn't, depending on system setup and even 
timings.

Hm, maybe you don't even really benefit from the standard three strikes 
and you are out policy, and for this specific use case, you should just 
kill it straight away. If it couldn't be preempted once, why pay the 
penalty any more?

If you don't have it already, devising a solution which blacklists the 
process (if it creates more contexts), or even a parent (if forking is 
applicable and implementation feasible), for offenders could also be 
beneficial.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 17:41           ` Tvrtko Ursulin
@ 2018-03-22 19:08             ` Jeff McGee
  2018-03-22 19:59               ` Bloomfield, Jon
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff McGee @ 2018-03-22 19:08 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: kalyan.kondapally, intel-gfx, ben

On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> 
> On 22/03/2018 16:01, Jeff McGee wrote:
> >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 22/03/2018 14:34, Jeff McGee wrote:
> >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> >>>>>
> >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> >>>>>>
> >>>>>>Force preemption uses engine reset to enforce a limit on the time
> >>>>>>that a request targeted for preemption can block. This feature is
> >>>>>>a requirement in automotive systems where the GPU may be shared by
> >>>>>>clients of critically high priority and clients of low priority that
> >>>>>>may not have been curated to be preemption friendly. There may be
> >>>>>>more general applications of this feature. I'm sharing as an RFC to
> >>>>>>stimulate that discussion and also to get any technical feedback
> >>>>>>that I can before submitting to the product kernel that needs this.
> >>>>>>I have developed the patches for ease of rebase, given that this is
> >>>>>>for the moment considered a non-upstreamable feature. It would be
> >>>>>>possible to refactor hangcheck to fully incorporate force preemption
> >>>>>>as another tier of patience (or impatience) with the running request.
> >>>>>
> >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this work
> >>>>>only with stateless clients - or in other words, what would happen to
> >>>>>stateful clients which would be force preempted? Or the answer is we
> >>>>>don't care since they are misbehaving?
> >>>>
> >>>>They get notified of being guilty for causing a gpu reset; three strikes
> >>>>and they are out (banned from using the gpu) using the current rules.
> >>>>This is a very blunt hammer that requires the rest of the system to be
> >>>>robust; one might argue time spent making the system robust would be
> >>>>better served making sure that the timer never expired in the first place
> >>>>thereby eliminating the need for a forced gpu reset.
> >>>>-Chris
> >>>
> >>>Yes, for simplication the policy applied to force preempted contexts
> >>>is the same as for hanging contexts. It is known that this feature
> >>>should not be required in a fully curated system. It's a requirement
> >>>if end user will be alllowed to install 3rd party apps to run in the
> >>>non-critical domain.
> >>
> >>My concern is whether it safe to call this force _preemption_, while
> >>it is not really expected to work as preemption from the point of
> >>view of preempted context. I may be missing some angle here, but I
> >>think a better name would include words like maximum request
> >>duration or something.
> >>
> >>I can see a difference between allowed maximum duration when there
> >>is something else pending, and when it isn't, but I don't
> >>immediately see that we should consider this distinction for any
> >>real benefit?
> >>
> >>So should the feature just be "maximum request duration"? This would
> >>perhaps make it just a special case of hangcheck, which ignores head
> >>progress, or whatever we do in there.
> >>
> >>Regards,
> >>
> >>Tvrtko
> >
> >I think you might be unclear about how this works. We're not starting a
> >preemption to see if we can cleanly remove a request who has begun to
> >exceed its normal time slice, i.e. hangcheck. This is about bounding
> >the time that a normal preemption can take. So first start preemption
> >in response to higher-priority request arrival, then wait for preemption
> >to complete within a certain amount of time. If it does not, resort to
> >reset.
> >
> >So it's really "force the resolution of a preemption", shortened to
> >"force preemption".
> 
> You are right, I veered off in my thinking and ended up with
> something different. :)
> 
> I however still think the name is potentially misleading, since the
> request/context is not getting preempted. It is getting effectively
> killed (sooner or later, directly or indirectly).
> 
> Maybe that is OK for the specific use case when everything is only
> broken and not malicious.
> 
> In a more general purpose system it would be a bit random when
> something would work, and when it wouldn't, depending on system
> setup and even timings.
> 
> Hm, maybe you don't even really benefit from the standard three
> strikes and you are out policy, and for this specific use case, you
> should just kill it straight away. If it couldn't be preempted once,
> why pay the penalty any more?
> 
> If you don't have it already, devising a solution which blacklists
> the process (if it creates more contexts), or even a parent (if
> forking is applicable and implementation feasible), for offenders
> could also be beneficial.
> 
> Regards,
> 
> Tvrtko

Fair enough. There wasn't a lot of deliberation on this name. We
referred to it in various ways during development. I think I started
using "force preemption" because it was short. "reset to preempt" was
another phrase that was used.

The handling of the guilty client/context could be tailored more. Like
I said it was easiest to start with the same sort of handling that we
have already for hang scenarios. Simple is good when you are rebasing
without much hope to upstream. :(

If there was interest in upstreaming this capability, we could certainly
incorporate nicely within a refactoring of hangcheck. And then we
wouldn't even need a special name for it. The whole thing could be recast
as time slice management, where your slice is condition-based. You get
unlimited time if no one wants the engine, X time if someone of equal
priority wants the engine, and Y time if someone of higher priority
wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
value in my RFC.

Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 19:08             ` Jeff McGee
@ 2018-03-22 19:59               ` Bloomfield, Jon
  2018-03-23 13:20                 ` Joonas Lahtinen
  0 siblings, 1 reply; 29+ messages in thread
From: Bloomfield, Jon @ 2018-03-22 19:59 UTC (permalink / raw)
  To: Mcgee, Jeff, Tvrtko Ursulin; +Cc: Kondapally, Kalyan, intel-gfx, ben

> From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf
> Of Jeff McGee
> Sent: Thursday, March 22, 2018 12:09 PM
> To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Kondapally, Kalyan <kalyan.kondapally@intel.com>; intel-
> gfx@lists.freedesktop.org; ben@bwidawsk.net
> Subject: Re: [Intel-gfx] [RFC 0/8] Force preemption
> 
> On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> >
> > On 22/03/2018 16:01, Jeff McGee wrote:
> > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> > >>
> > >>On 22/03/2018 14:34, Jeff McGee wrote:
> > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > >>>>>
> > >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> > >>>>>>
> > >>>>>>Force preemption uses engine reset to enforce a limit on the time
> > >>>>>>that a request targeted for preemption can block. This feature is
> > >>>>>>a requirement in automotive systems where the GPU may be
> shared by
> > >>>>>>clients of critically high priority and clients of low priority that
> > >>>>>>may not have been curated to be preemption friendly. There may
> be
> > >>>>>>more general applications of this feature. I'm sharing as an RFC to
> > >>>>>>stimulate that discussion and also to get any technical feedback
> > >>>>>>that I can before submitting to the product kernel that needs this.
> > >>>>>>I have developed the patches for ease of rebase, given that this is
> > >>>>>>for the moment considered a non-upstreamable feature. It would
> be
> > >>>>>>possible to refactor hangcheck to fully incorporate force
> preemption
> > >>>>>>as another tier of patience (or impatience) with the running
> request.
> > >>>>>
> > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this
> work
> > >>>>>only with stateless clients - or in other words, what would happen to
> > >>>>>stateful clients which would be force preempted? Or the answer is
> we
> > >>>>>don't care since they are misbehaving?
> > >>>>
> > >>>>They get notified of being guilty for causing a gpu reset; three strikes
> > >>>>and they are out (banned from using the gpu) using the current rules.
> > >>>>This is a very blunt hammer that requires the rest of the system to be
> > >>>>robust; one might argue time spent making the system robust would
> be
> > >>>>better served making sure that the timer never expired in the first
> place
> > >>>>thereby eliminating the need for a forced gpu reset.
> > >>>>-Chris
> > >>>
> > >>>Yes, for simplication the policy applied to force preempted contexts
> > >>>is the same as for hanging contexts. It is known that this feature
> > >>>should not be required in a fully curated system. It's a requirement
> > >>>if end user will be alllowed to install 3rd party apps to run in the
> > >>>non-critical domain.
> > >>
> > >>My concern is whether it safe to call this force _preemption_, while
> > >>it is not really expected to work as preemption from the point of
> > >>view of preempted context. I may be missing some angle here, but I
> > >>think a better name would include words like maximum request
> > >>duration or something.
> > >>
> > >>I can see a difference between allowed maximum duration when there
> > >>is something else pending, and when it isn't, but I don't
> > >>immediately see that we should consider this distinction for any
> > >>real benefit?
> > >>
> > >>So should the feature just be "maximum request duration"? This would
> > >>perhaps make it just a special case of hangcheck, which ignores head
> > >>progress, or whatever we do in there.
> > >>
> > >>Regards,
> > >>
> > >>Tvrtko
> > >
> > >I think you might be unclear about how this works. We're not starting a
> > >preemption to see if we can cleanly remove a request who has begun to
> > >exceed its normal time slice, i.e. hangcheck. This is about bounding
> > >the time that a normal preemption can take. So first start preemption
> > >in response to higher-priority request arrival, then wait for preemption
> > >to complete within a certain amount of time. If it does not, resort to
> > >reset.
> > >
> > >So it's really "force the resolution of a preemption", shortened to
> > >"force preemption".
> >
> > You are right, I veered off in my thinking and ended up with
> > something different. :)
> >
> > I however still think the name is potentially misleading, since the
> > request/context is not getting preempted. It is getting effectively
> > killed (sooner or later, directly or indirectly).
> >
> > Maybe that is OK for the specific use case when everything is only
> > broken and not malicious.
> >
> > In a more general purpose system it would be a bit random when
> > something would work, and when it wouldn't, depending on system
> > setup and even timings.
> >
> > Hm, maybe you don't even really benefit from the standard three
> > strikes and you are out policy, and for this specific use case, you
> > should just kill it straight away. If it couldn't be preempted once,
> > why pay the penalty any more?
> >
> > If you don't have it already, devising a solution which blacklists
> > the process (if it creates more contexts), or even a parent (if
> > forking is applicable and implementation feasible), for offenders
> > could also be beneficial.
> >
> > Regards,
> >
> > Tvrtko
> 
> Fair enough. There wasn't a lot of deliberation on this name. We
> referred to it in various ways during development. I think I started
> using "force preemption" because it was short. "reset to preempt" was
> another phrase that was used.
> 
> The handling of the guilty client/context could be tailored more. Like
> I said it was easiest to start with the same sort of handling that we
> have already for hang scenarios. Simple is good when you are rebasing
> without much hope to upstream. :(
> 
> If there was interest in upstreaming this capability, we could certainly
> incorporate nicely within a refactoring of hangcheck. And then we
> wouldn't even need a special name for it. The whole thing could be recast
> as time slice management, where your slice is condition-based. You get
> unlimited time if no one wants the engine, X time if someone of equal
> priority wants the engine, and Y time if someone of higher priority
> wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
> value in my RFC.
> 
On the subject of it being "a bit random when something would work":
Currently it's the well behaved app (and everything else behind it)
that pays the price for the badly-behaved umd being a bad citizen. 12s is
a really long time for the GUI to freeze before a bad context can be nuked.

Yes, with enforced pre-emption latency, the bad umd pays a price in that,
if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad
umd at the end of the day. And we should find it and fix it. 

With fair-scheduling, this becomes even more important - all umd's need
to ensure that they are capable of pre-empting to the finest granularity
supported by the h/w for their respective workloads. If they don't
they really should be obliterated (aka detected and fixed).

The big benefit I see with the new approach (vs TDR) is that there's actually
no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a
context at all, providing it plays nicely. If you want to run a really long
GPGPU workload, do so by all means, just don't hog the GPU.

BTW, the same context banning should be applied to a context nuked
by forced-preemption as with TDR. That's the intention at least. This is just
an RFC right now.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 19:59               ` Bloomfield, Jon
@ 2018-03-23 13:20                 ` Joonas Lahtinen
  2018-03-23 13:37                   ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Joonas Lahtinen @ 2018-03-23 13:20 UTC (permalink / raw)
  To: Bloomfield, Jon, Mcgee, Jeff, Tvrtko Ursulin
  Cc: Kondapally, Kalyan, intel-gfx, ben

Quoting Bloomfield, Jon (2018-03-22 21:59:33)
> > From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf
> > Of Jeff McGee
> > Sent: Thursday, March 22, 2018 12:09 PM
> > To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Cc: Kondapally, Kalyan <kalyan.kondapally@intel.com>; intel-
> > gfx@lists.freedesktop.org; ben@bwidawsk.net
> > Subject: Re: [Intel-gfx] [RFC 0/8] Force preemption
> > 
> > On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> > >
> > > On 22/03/2018 16:01, Jeff McGee wrote:
> > > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> > > >>
> > > >>On 22/03/2018 14:34, Jeff McGee wrote:
> > > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > >>>>>
> > > >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> > > >>>>>>
> > > >>>>>>Force preemption uses engine reset to enforce a limit on the time
> > > >>>>>>that a request targeted for preemption can block. This feature is
> > > >>>>>>a requirement in automotive systems where the GPU may be
> > shared by
> > > >>>>>>clients of critically high priority and clients of low priority that
> > > >>>>>>may not have been curated to be preemption friendly. There may
> > be
> > > >>>>>>more general applications of this feature. I'm sharing as an RFC to
> > > >>>>>>stimulate that discussion and also to get any technical feedback
> > > >>>>>>that I can before submitting to the product kernel that needs this.
> > > >>>>>>I have developed the patches for ease of rebase, given that this is
> > > >>>>>>for the moment considered a non-upstreamable feature. It would
> > be
> > > >>>>>>possible to refactor hangcheck to fully incorporate force
> > preemption
> > > >>>>>>as another tier of patience (or impatience) with the running
> > request.
> > > >>>>>
> > > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this
> > work
> > > >>>>>only with stateless clients - or in other words, what would happen to
> > > >>>>>stateful clients which would be force preempted? Or the answer is
> > we
> > > >>>>>don't care since they are misbehaving?
> > > >>>>
> > > >>>>They get notified of being guilty for causing a gpu reset; three strikes
> > > >>>>and they are out (banned from using the gpu) using the current rules.
> > > >>>>This is a very blunt hammer that requires the rest of the system to be
> > > >>>>robust; one might argue time spent making the system robust would
> > be
> > > >>>>better served making sure that the timer never expired in the first
> > place
> > > >>>>thereby eliminating the need for a forced gpu reset.
> > > >>>>-Chris
> > > >>>
> > > >>>Yes, for simplication the policy applied to force preempted contexts
> > > >>>is the same as for hanging contexts. It is known that this feature
> > > >>>should not be required in a fully curated system. It's a requirement
> > > >>>if end user will be alllowed to install 3rd party apps to run in the
> > > >>>non-critical domain.
> > > >>
> > > >>My concern is whether it safe to call this force _preemption_, while
> > > >>it is not really expected to work as preemption from the point of
> > > >>view of preempted context. I may be missing some angle here, but I
> > > >>think a better name would include words like maximum request
> > > >>duration or something.
> > > >>
> > > >>I can see a difference between allowed maximum duration when there
> > > >>is something else pending, and when it isn't, but I don't
> > > >>immediately see that we should consider this distinction for any
> > > >>real benefit?
> > > >>
> > > >>So should the feature just be "maximum request duration"? This would
> > > >>perhaps make it just a special case of hangcheck, which ignores head
> > > >>progress, or whatever we do in there.
> > > >>
> > > >>Regards,
> > > >>
> > > >>Tvrtko
> > > >
> > > >I think you might be unclear about how this works. We're not starting a
> > > >preemption to see if we can cleanly remove a request who has begun to
> > > >exceed its normal time slice, i.e. hangcheck. This is about bounding
> > > >the time that a normal preemption can take. So first start preemption
> > > >in response to higher-priority request arrival, then wait for preemption
> > > >to complete within a certain amount of time. If it does not, resort to
> > > >reset.
> > > >
> > > >So it's really "force the resolution of a preemption", shortened to
> > > >"force preemption".
> > >
> > > You are right, I veered off in my thinking and ended up with
> > > something different. :)
> > >
> > > I however still think the name is potentially misleading, since the
> > > request/context is not getting preempted. It is getting effectively
> > > killed (sooner or later, directly or indirectly).
> > >
> > > Maybe that is OK for the specific use case when everything is only
> > > broken and not malicious.
> > >
> > > In a more general purpose system it would be a bit random when
> > > something would work, and when it wouldn't, depending on system
> > > setup and even timings.
> > >
> > > Hm, maybe you don't even really benefit from the standard three
> > > strikes and you are out policy, and for this specific use case, you
> > > should just kill it straight away. If it couldn't be preempted once,
> > > why pay the penalty any more?
> > >
> > > If you don't have it already, devising a solution which blacklists
> > > the process (if it creates more contexts), or even a parent (if
> > > forking is applicable and implementation feasible), for offenders
> > > could also be beneficial.
> > >
> > > Regards,
> > >
> > > Tvrtko
> > 
> > Fair enough. There wasn't a lot of deliberation on this name. We
> > referred to it in various ways during development. I think I started
> > using "force preemption" because it was short. "reset to preempt" was
> > another phrase that was used.
> > 
> > The handling of the guilty client/context could be tailored more. Like
> > I said it was easiest to start with the same sort of handling that we
> > have already for hang scenarios. Simple is good when you are rebasing
> > without much hope to upstream. :(
> > 
> > If there was interest in upstreaming this capability, we could certainly
> > incorporate nicely within a refactoring of hangcheck. And then we
> > wouldn't even need a special name for it. The whole thing could be recast
> > as time slice management, where your slice is condition-based. You get
> > unlimited time if no one wants the engine, X time if someone of equal
> > priority wants the engine, and Y time if someone of higher priority
> > wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
> > value in my RFC.
> > 
> On the subject of it being "a bit random when something would work":
> Currently it's the well behaved app (and everything else behind it)
> that pays the price for the badly-behaved umd being a bad citizen. 12s is
> a really long time for the GUI to freeze before a bad context can be nuked.

But the price is not that high when you get a momentary freeze of the
system, compared to applications dying on the user and unsaved work
getting lost, for example.

> Yes, with enforced pre-emption latency, the bad umd pays a price in that,
> if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad
> umd at the end of the day. And we should find it and fix it. 

I think the key issue is the difference in difficulty between finding an
offending GPU hogger in a completely controlled product environment vs. in
the wild in desktop environment.

In the wild an application could've worked for years without problems
(being a GPU hogger, whatever the limit is set at), and stops working when
another application is introduced that demands to use the GPU at a higher
priority and the GPU hogging application won't slow down, but rather
will just die.

Requiring all userspace applications to suddenly have a short ARB check
period, or being in danger of getting killed is not a light change to
make in the generic desktop environment.

So some clear opt-in in the form of "Sacrifice everything to keep running
at 60 FPS [ ]" tick-box from compositor would be required.

So far nobody has been succesful in selling this to the userspace
compositors (the most likely user) or has somebody?

Regards, Joonas

> With fair-scheduling, this becomes even more important - all umd's need
> to ensure that they are capable of pre-empting to the finest granularity
> supported by the h/w for their respective workloads. If they don't
> they really should be obliterated (aka detected and fixed).
> 
> The big benefit I see with the new approach (vs TDR) is that there's actually
> no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a
> context at all, providing it plays nicely. If you want to run a really long
> GPGPU workload, do so by all means, just don't hog the GPU.
> 
> BTW, the same context banning should be applied to a context nuked
> by forced-preemption as with TDR. That's the intention at least. This is just
> an RFC right now.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-23 13:20                 ` Joonas Lahtinen
@ 2018-03-23 13:37                   ` Chris Wilson
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wilson @ 2018-03-23 13:37 UTC (permalink / raw)
  To: Joonas Lahtinen, Bloomfield, Jon, Mcgee, Jeff, Tvrtko Ursulin
  Cc: Kondapally, Kalyan, intel-gfx, ben

Quoting Joonas Lahtinen (2018-03-23 13:20:40)
> So far nobody has been succesful in selling this to the userspace
> compositors (the most likely user) or has somebody?

I hadn't even contemplated selling it. However, it does seem applicable
to the RealTime priorities for Vk and 
https://www.khronos.org/registry/EGL/extensions/NV/EGL_NV_context_priority_realtime.txt

Gah. More fiddling with EGL...

Personally I don't think I'd want X/gnome-shell/weston to adopt a
RealTime or bust philosophy (i.e. I'm not planning to send a patch to
raise weston from PRIORITY_HIGH to PRIORITY_RT), but that doesn't mean
there aren't scenarios where it will be important.

So a context param, say PREEMPT_TIMEOUT in ns with 0 meaning disable and
bounded by CAP_SYS_NICE. Too soft? CAP_SYS_ADMIN, since it's about
giving permission to nuke others? Implementation side, having it on the
context is a fiddle, but we should be able to set a queue_timeout
at the same time as setting queue_priority to save having to find the
preempting context at irq time.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-03-23 13:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 17:26 [RFC 0/8] Force preemption jeff.mcgee
2018-03-21 17:26 ` [RFC 1/8] drm/i915/execlists: Refactor out complete_preempt_context() jeff.mcgee
2018-03-21 17:26 ` [RFC 2/8] drm/i915: Add control flags to i915_handle_error() jeff.mcgee
2018-03-21 17:26 ` [RFC 3/8] drm/i915: Move engine reset prepare/finish to backends jeff.mcgee
2018-03-21 17:26 ` [RFC 4/8] drm/i915: Split execlists/guc reset prepartions jeff.mcgee
2018-03-21 17:26 ` [RFC 5/8] drm/i915/execlists: Flush pending preemption events during reset jeff.mcgee
2018-03-21 17:26 ` [RFC 6/8] drm/i915: Fix loop on CSB processing jeff.mcgee
2018-03-21 17:33   ` Jeff McGee
2018-03-21 18:06     ` Chris Wilson
2018-03-21 18:29       ` Jeff McGee
2018-03-21 19:04         ` Chris Wilson
2018-03-21 17:26 ` [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
2018-03-21 17:31   ` Jeff McGee
2018-03-21 18:12     ` Chris Wilson
2018-03-21 19:06       ` Chris Wilson
2018-03-21 17:26 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
2018-03-21 18:50 ` ✗ Fi.CI.BAT: failure for Force preemption (rev2) Patchwork
2018-03-22  9:22 ` [RFC 0/8] Force preemption Tvrtko Ursulin
2018-03-22  9:28   ` Chris Wilson
2018-03-22 14:34     ` Jeff McGee
2018-03-22 15:35       ` Chris Wilson
2018-03-22 15:44         ` Jeff McGee
2018-03-22 15:57       ` Tvrtko Ursulin
2018-03-22 16:01         ` Jeff McGee
2018-03-22 17:41           ` Tvrtko Ursulin
2018-03-22 19:08             ` Jeff McGee
2018-03-22 19:59               ` Bloomfield, Jon
2018-03-23 13:20                 ` Joonas Lahtinen
2018-03-23 13:37                   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.