All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Force preemption
@ 2018-03-16 18:30 jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Force preemption uses engine reset to enforce a limit on the time
that a request targeted for preemption can block. This feature is
a requirement in automotive systems where the GPU may be shared by
clients of critically high priority and clients of low priority that
may not have been curated to be preemption friendly. There may be
more general applications of this feature. I'm sharing as an RFC to
stimulate that discussion and also to get any technical feedback
that I can before submitting to the product kernel that needs this.
I have developed the patches for ease of rebase, given that this is
for the moment considered a non-upstreamable feature. It would be
possible to refactor hangcheck to fully incorporate force preemption
as another tier of patience (or impatience) with the running request.

Jeff McGee (8):
  drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  drm/i915: Skip CSB processing on invalid CSB tail
  drm/i915: Execlists to mark the HWSP upon preemption finished
  drm/i915: Add a wait_for routine with more exact timeout
  drm/i915: Consider preemption when finding the active request
  drm/i915: Repair the preemption context if hit by reset
  drm/i915: Allow reset without error capture
  drm/i915: Force preemption to complete via engine reset

 drivers/gpu/drm/i915/i915_drv.c         | 14 +++++-
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++
 drivers/gpu/drm/i915/i915_gem.c         | 86 ++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_irq.c         | 75 ++++++++++++++++------------
 drivers/gpu/drm/i915/i915_params.c      |  3 ++
 drivers/gpu/drm/i915/i915_params.h      |  1 +
 drivers/gpu/drm/i915/intel_drv.h        | 22 +++++++++
 drivers/gpu/drm/i915/intel_engine_cs.c  | 53 ++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 81 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h | 20 +++++++-
 10 files changed, 314 insertions(+), 45 deletions(-)

-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
@ 2018-03-16 18:30 ` jeff.mcgee
  2018-03-16 20:30   ` Chris Wilson
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

It is possible for the hardware to be reset just as a context is
completing. The reset post-processing may see the seqno update and
assume that the context escaped corruption, but the context may
have been disrupted in the save out process. The corruption may
screw up the HEAD and TAIL pointers such that the next submission
of the context switches out without running the intended request.
The GEM_BUG_ON will be hit in this situation, but it is not really
a driver error. So make it a GEM_WARN_ON so that we notice while
letting hangcheck detect and clean up.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly with
      CONFIG_DRM_I915_DEBUG_GEM.
Change-Id: I87da4b16bad805fe48153a9ed9169900681ebba7
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b44861459d24..7d93fcd56d34 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -920,7 +920,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 			GEM_BUG_ON(count == 0);
 			if (--count == 0) {
 				GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
-				GEM_BUG_ON(!i915_gem_request_completed(rq));
+				GEM_WARN_ON(!i915_gem_request_completed(rq));
 				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
 
 				trace_i915_gem_request_out(rq);
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
@ 2018-03-16 18:30 ` jeff.mcgee
  2018-03-16 20:30   ` Chris Wilson
  2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Engine reset is fast. A context switch interrupt may be generated just
prior to the reset such that the top half handler is racing with reset
post-processing. The handler may set the irq_posted bit again after
the reset code has cleared it to start fresh. Then the re-enabled
tasklet will read the CSB head and tail from MMIO, which will be at
the hardware reset values of 0 and 7 respectively, given that no
actual CSB event has occurred since the reset. Mayhem then ensues as
the tasklet starts processing invalid CSB entries.

We can handle this corner case without adding any new synchronization
between the irq top half and the reset work item. The tasklet can
just skip CSB processing if the tail is not sane.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: Ic7eb00600480bd62331c397dd92b747d946241e4
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7d93fcd56d34..5f63d1d6a2d6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -851,6 +851,14 @@ static void intel_lrc_irq_handler(unsigned long data)
 			head = readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 			tail = GEN8_CSB_WRITE_PTR(head);
 			head = GEN8_CSB_READ_PTR(head);
+
+			/* The MMIO read CSB tail may be at the reset value of
+			 * 0x7 if there hasn't been a valid CSB event since
+			 * the engine reset. Skip on to dequeue if so.
+			 */
+			if (tail >= GEN8_CSB_ENTRIES)
+				break;
+
 			execlists->csb_head = head;
 		} else {
 			const int write_idx =
@@ -859,6 +867,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 
 			head = execlists->csb_head;
 			tail = READ_ONCE(buf[write_idx]);
+			GEM_BUG_ON(tail >= GEN8_CSB_ENTRIES);
 		}
 
 		while (head != tail) {
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

In the next patch we are improving how we find the active request
on an engine with a pending preemption. We need to know if the
preemption has completed or not. This determination can be made
most robustly by having the preemption context write a preemption
finished indicator to the hardware status page.

This patch is required to support the force preemption feature.

Change-Id: I4123283aec02e21d13c7cb55f329cf3f553b5d2c
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 27 ++++++++++++++++++++++++---
 drivers/gpu/drm/i915/intel_ringbuffer.h | 14 +++++++++++++-
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 5f63d1d6a2d6..b2f838c484b0 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -512,19 +512,37 @@ static void port_assign(struct execlist_port *port,
 	port_set(port, port_pack(i915_gem_request_get(rq), port_count(port)));
 }
 
+#define PREEMPT_BREADCRUMB_DWORDS 0x8
 static void inject_preempt_context(struct intel_engine_cs *engine)
 {
 	struct intel_context *ce =
 		&engine->i915->preempt_context->engine[engine->id];
+	u32 *cs = ce->ring->vaddr + ce->ring->tail;
 	u32 __iomem *elsp =
 		engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine));
 	unsigned int n;
 
 	GEM_BUG_ON(engine->i915->preempt_context->hw_id != PREEMPT_ID);
-	GEM_BUG_ON(!IS_ALIGNED(ce->ring->size, WA_TAIL_BYTES));
+	GEM_BUG_ON(intel_engine_preempt_finished(engine));
 
-	memset(ce->ring->vaddr + ce->ring->tail, 0, WA_TAIL_BYTES);
-	ce->ring->tail += WA_TAIL_BYTES;
+	if (engine->id == RCS) {
+		cs = gen8_emit_ggtt_write_rcs(cs, I915_GEM_HWS_PREEMPT_FINISHED,
+				intel_hws_preempt_done_address(engine));
+	} else {
+		cs = gen8_emit_ggtt_write(cs, I915_GEM_HWS_PREEMPT_FINISHED,
+				intel_hws_preempt_done_address(engine));
+		*cs++ = MI_NOOP;
+		*cs++ = MI_NOOP;
+	}
+	*cs++ = MI_NOOP;
+	*cs++ = MI_NOOP;
+
+	GEM_BUG_ON(!IS_ALIGNED(ce->ring->size,
+			       PREEMPT_BREADCRUMB_DWORDS * sizeof(u32)));
+	GEM_BUG_ON((void *)cs - (ce->ring->vaddr + ce->ring->tail) !=
+		   PREEMPT_BREADCRUMB_DWORDS * sizeof(u32));
+
+	ce->ring->tail += PREEMPT_BREADCRUMB_DWORDS * sizeof(u32);
 	ce->ring->tail &= (ce->ring->size - 1);
 	ce->lrc_reg_state[CTX_RING_TAIL+1] = ce->ring->tail;
 
@@ -911,6 +929,8 @@ static void intel_lrc_irq_handler(unsigned long data)
 								EXECLISTS_ACTIVE_PREEMPT));
 				execlists_clear_active(execlists,
 						       EXECLISTS_ACTIVE_PREEMPT);
+				GEM_BUG_ON(!intel_engine_preempt_finished(engine));
+				intel_engine_clear_preempt(engine);
 				continue;
 			}
 
@@ -1507,6 +1527,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	execlists->csb_head = -1;
 	execlists->active = 0;
+	intel_engine_clear_preempt(engine);
 
 	/* After a GPU reset, we may have requests to replay */
 	if (!i915_modparams.enable_guc_submission && execlists->first)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 5b9117b3cba4..25eb23bc06eb 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -625,11 +625,12 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
 #define I915_GEM_HWS_INDEX_ADDR (I915_GEM_HWS_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 #define I915_GEM_HWS_PREEMPT_INDEX	0x32
 #define I915_GEM_HWS_PREEMPT_ADDR (I915_GEM_HWS_PREEMPT_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
+#define I915_GEM_HWS_PREEMPT_FINISHED	1
 #define I915_GEM_HWS_PID_INDEX 0x40
 #define I915_GEM_HWS_PID_ADDR (I915_GEM_HWS_PID_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 #define I915_GEM_HWS_CID_INDEX 0x48
 #define I915_GEM_HWS_CID_ADDR (I915_GEM_HWS_CID_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
-#define I915_GEM_HWS_SCRATCH_INDEX     0x50
+#define I915_GEM_HWS_SCRATCH_INDEX	0x58
 #define I915_GEM_HWS_SCRATCH_ADDR (I915_GEM_HWS_SCRATCH_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 
 #define I915_HWS_CSB_BUF0_INDEX		0x10
@@ -749,6 +750,17 @@ static inline u32 intel_engine_get_seqno(struct intel_engine_cs *engine)
 	return intel_read_status_page(engine, I915_GEM_HWS_INDEX);
 }
 
+static inline bool intel_engine_preempt_finished(struct intel_engine_cs *engine)
+{
+	return (intel_read_status_page(engine, I915_GEM_HWS_PREEMPT_INDEX) ==
+		I915_GEM_HWS_PREEMPT_FINISHED);
+}
+
+static inline void intel_engine_clear_preempt(struct intel_engine_cs *engine)
+{
+	intel_write_status_page(engine, I915_GEM_HWS_PREEMPT_INDEX, 0);
+}
+
 static inline u32 intel_engine_last_submit(struct intel_engine_cs *engine)
 {
 	/* We are only peeking at the tail of the submit queue (and not the
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (2 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The current non-atomic wait_for routines convert the provided usec
timeout into jiffies with upward rounding. This can increase the
actual timeout several msecs which is fine in most cases. In the next
patch we need the timeout to conform more exactly to what is
requested.

This patch is required to support the force preemption feature.

Change-Id: I20372ccd4db6609c42a6b41000ca7ff7700358fc
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_drv.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index fa800918f67d..7cb6619129c3 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -75,6 +75,28 @@
 						   (Wmax))
 #define wait_for(COND, MS)		_wait_for((COND), (MS) * 1000, 10, 1000)
 
+#define _wait_for_exact(COND, US, Wmin, Wmax) ({ \
+	u64 timeout__ = local_clock() + (US) * 1000;			\
+	long wait__ = (Wmin); /* recommended min for usleep is 10 us */	\
+	int ret__;							\
+	might_sleep();							\
+	for (;;) {							\
+		bool expired__ = (local_clock() > timeout__);		\
+		if (COND) {						\
+			ret__ = 0;					\
+			break;						\
+		}							\
+		if (expired__) {					\
+			ret__ = -ETIMEDOUT;				\
+			break;						\
+		}							\
+		usleep_range(wait__, wait__ * 2);			\
+		if (wait__ < (Wmax))					\
+			wait__ <<= 1;					\
+	}								\
+	ret__;								\
+})
+
 /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
 #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
 # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) WARN_ON_ONCE((ATOMIC) && !in_atomic())
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 5/8] drm/i915: Consider preemption when finding the active request
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (3 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The active request is found by scanning the engine timeline for the
request that follows the last completed request. That method is accurate
if there is no preemption in progress, because the engine will certainly
have started that request. If there is a preemption in progress, it could
have completed leaving the engine idle. In this case the request we
identified with the above method is not active and may not even have been
started. We must check for this condition to avoid fingering an innocent
request during reset.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: I63a9f64446e24d4ee36b4af32854699bda006ddd
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 49 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e2961e3913b8..9780d9026ce6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2739,9 +2739,9 @@ static void i915_gem_context_mark_innocent(struct i915_gem_context *ctx)
 }
 
 struct drm_i915_gem_request *
-i915_gem_find_active_request(struct intel_engine_cs *engine)
+i915_gem_find_pending_request(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request, *active = NULL;
+	struct drm_i915_gem_request *request, *pending = NULL;
 	unsigned long flags;
 
 	/* We are called by the error capture and reset at a random
@@ -2762,12 +2762,53 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 		GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
 				    &request->fence.flags));
 
-		active = request;
+		pending = request;
 		break;
 	}
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 
-	return active;
+	return pending;
+}
+
+struct drm_i915_gem_request *
+i915_gem_find_active_request(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+	struct drm_i915_gem_request *request;
+
+	/* The pending request is active if no preemption is in progress */
+	if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT))
+		return i915_gem_find_pending_request(engine);
+
+	/* Preemption has finished. Engine idle. */
+	if (intel_engine_preempt_finished(engine))
+		return NULL;
+
+	request = i915_gem_find_pending_request(engine);
+
+	/* Preemption has flushed all requests off the engine. Engine idle. */
+	if (!request)
+		return NULL;
+
+	/* The pending request likely is active and blocking the in-progress
+	 * preemption. But there is a race in our previous checks. The request
+	 * that was actually blocking preemption could have completed (a batch-
+	 * boundary preemption) such that the engine is idle and the pending
+	 * request we have identified was the next in line. We must wait for
+	 * at least as long as it would take for the preempt-to-idle context
+	 * to mark the preemption done to verify this. We use 500 usecs to
+	 * account for a worst case delay from the seqno write of the
+	 * completing request and the preempt finished write.
+	 */
+	if (!_wait_for_exact(intel_engine_preempt_finished(engine), 500, 10, 50))
+		return NULL;
+
+	/*
+	 * We didn't see the preemption done after a sufficient wait. Thus the
+	 * pending request we sampled above was in fact active and blocking
+	 * the preemption.
+	 */
+	return request;
 }
 
 static bool engine_stalled(struct intel_engine_cs *engine)
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 6/8] drm/i915: Repair the preemption context if hit by reset
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (4 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

It is possible for the preemption context to be active on an engine
when the engine is reset. The preemption context is not tracked via
the request mechanism, so the normal reset handling will not detect
this scenario and will not be able to fixup possible context
corruption. So add some extra logic to do this.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: Ifd0f17726111f3b702dd900064b7f375bbb42808
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b2f838c484b0..581483886153 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1619,6 +1619,25 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 
+	/* If a preemption was pending when the reset occurred, and no
+	 * active request was found when the reset completed, it is
+	 * possible that the preemption context was hit by the reset.
+	 * We must assume that the context is corrupted so repair it.
+	 */
+	if (execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT) &&
+	    !request) {
+		struct i915_gem_context *ctx = engine->i915->preempt_context;
+		ce = &ctx->engine[engine->id];
+
+		execlists_init_reg_state(ce->lrc_reg_state,
+					 ctx, engine, ce->ring);
+		ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
+			i915_ggtt_offset(ce->ring->vma);
+		ce->lrc_reg_state[CTX_RING_HEAD+1] = ce->ring->tail;
+
+		return;
+	}
+
 	/* If the request was innocent, we leave the request in the ELSP
 	 * and will try to replay it on restarting. The context image may
 	 * have been corrupted by the reset, in which case we may have
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 7/8] drm/i915: Allow reset without error capture
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (5 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 20:33   ` Chris Wilson
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Pull the reset handling out of i915_handle_error() so that it can be
called by that function and directly by the upcoming force preemption
handler. This allows the force preemption handler to bypass the error
capture that i915_handle_error() does before getting on with the
reset. We do not want error capture for force preemption because it
adds significant latency (~10 msecs measured on APL).

This patch is required to support the force preemption feature.

Change-Id: I41b4fae1adc197f0e70cec47cb960a0d7fa55f48
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |  2 ++
 drivers/gpu/drm/i915/i915_irq.c | 75 +++++++++++++++++++++++------------------
 2 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d8524357373e..ade09f97be5c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3245,6 +3245,8 @@ __printf(3, 4)
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
 		       const char *fmt, ...);
+void i915_handle_reset(struct drm_i915_private *dev_priv,
+		       u32 engine_mask);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a34f459f8ac1..ab5d4d40083d 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2673,41 +2673,17 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
 }
 
 /**
- * i915_handle_error - handle a gpu error
+ * i915_handle_reset - handle a gpu reset
  * @dev_priv: i915 device private
- * @engine_mask: mask representing engines that are hung
- * @fmt: Error message format string
+ * @engine_mask: mask representing engines that require reset
  *
- * Do some basic checking of register state at error time and
- * dump it to the syslog.  Also call i915_capture_error_state() to make
- * sure we get a record and make it available in debugfs.  Fire a uevent
- * so userspace knows something bad happened (should trigger collection
- * of a ring dump etc.).
+ * Executes reset on the given engines.
  */
-void i915_handle_error(struct drm_i915_private *dev_priv,
-		       u32 engine_mask,
-		       const char *fmt, ...)
+void i915_handle_reset(struct drm_i915_private *dev_priv,
+		       u32 engine_mask)
 {
 	struct intel_engine_cs *engine;
 	unsigned int tmp;
-	va_list args;
-	char error_msg[80];
-
-	va_start(args, fmt);
-	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-	va_end(args);
-
-	/*
-	 * In most cases it's guaranteed that we get here with an RPM
-	 * reference held, for example because there is a pending GPU
-	 * request that won't finish until the reset is done. This
-	 * isn't the case at least when we get here by doing a
-	 * simulated reset via debugfs, so get an RPM reference.
-	 */
-	intel_runtime_pm_get(dev_priv);
-
-	i915_capture_error_state(dev_priv, engine_mask, error_msg);
-	i915_clear_error_registers(dev_priv);
 
 	/*
 	 * Try engine reset when available. We fall back to full reset if
@@ -2731,14 +2707,14 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	}
 
 	if (!engine_mask)
-		goto out;
+		return;
 
 	/* Full reset needs the mutex, stop any other user trying to do so. */
 	if (test_and_set_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags)) {
 		wait_event(dev_priv->gpu_error.reset_queue,
 			   !test_bit(I915_RESET_BACKOFF,
 				     &dev_priv->gpu_error.flags));
-		goto out;
+		return;
 	}
 
 	/* Prevent any other reset-engine attempt. */
@@ -2759,8 +2735,43 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 
 	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
 	wake_up_all(&dev_priv->gpu_error.reset_queue);
+}
+/**
+ * i915_handle_error - handle a gpu error
+ * @dev_priv: i915 device private
+ * @engine_mask: mask representing engines that are hung
+ * @fmt: Error message format string
+ *
+ * Do some basic checking of register state at error time and
+ * dump it to the syslog.  Also call i915_capture_error_state() to make
+ * sure we get a record and make it available in debugfs.  Fire a uevent
+ * so userspace knows something bad happened (should trigger collection
+ * of a ring dump etc.).
+ */
+void i915_handle_error(struct drm_i915_private *dev_priv,
+		       u32 engine_mask,
+		       const char *fmt, ...)
+{
+	va_list args;
+	char error_msg[80];
+
+	va_start(args, fmt);
+	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+	va_end(args);
+
+	/*
+	 * In most cases it's guaranteed that we get here with an RPM
+	 * reference held, for example because there is a pending GPU
+	 * request that won't finish until the reset is done. This
+	 * isn't the case at least when we get here by doing a
+	 * simulated reset via debugfs, so get an RPM reference.
+	 */
+	intel_runtime_pm_get(dev_priv);
+
+	i915_capture_error_state(dev_priv, engine_mask, error_msg);
+	i915_clear_error_registers(dev_priv);
+	i915_handle_reset(dev_priv, engine_mask);
 
-out:
 	intel_runtime_pm_put(dev_priv);
 }
 
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (6 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 20:39   ` Chris Wilson
  2018-03-16 20:44   ` Chris Wilson
  2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 33+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The hardware can complete the requested preemption at only certain points
in execution. Thus an uncooperative request that avoids those points can
block a preemption indefinitely. Our only option to bound the preemption
latency is to trigger reset and recovery just as we would if a request had
hung the hardware. This is so-called forced preemption. This change adds
that capability as an option for systems with extremely strict scheduling
latency requirements for its high priority requests. This option must be
used with great care. The force-preempted requests will be discarded at
the point of reset, resulting in various degrees of disruption to the
owning application up to and including crash.

The option is enabled by specifying a non-zero value for new i915 module
parameter fpreempt_timeout. This value becomes the time in milliseconds
after initiation of preemption at which the reset is triggered if the
preemption has not completed normally.

Test: Run IGT gem_exec_fpreempt.
Change-Id: Iafd3609012621c57fa9e490dfeeac46ae541b5c2
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 14 ++++++++-
 drivers/gpu/drm/i915/i915_drv.h         |  2 ++
 drivers/gpu/drm/i915/i915_gem.c         | 37 +++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_params.c      |  3 ++
 drivers/gpu/drm/i915/i915_params.h      |  1 +
 drivers/gpu/drm/i915/intel_engine_cs.c  | 53 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++++
 8 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b627370d5a9c..3f2394d61ea2 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -811,8 +811,16 @@ static int i915_workqueues_init(struct drm_i915_private *dev_priv)
 	if (dev_priv->hotplug.dp_wq == NULL)
 		goto out_free_wq;
 
+	if (INTEL_INFO(dev_priv)->has_logical_ring_preemption) {
+		dev_priv->fpreempt_wq = alloc_ordered_workqueue("i915-fpreempt",
+								WQ_HIGHPRI);
+		if (dev_priv->fpreempt_wq == NULL)
+			goto out_free_dp_wq;
+	}
 	return 0;
 
+out_free_dp_wq:
+	destroy_workqueue(dev_priv->hotplug.dp_wq);
 out_free_wq:
 	destroy_workqueue(dev_priv->wq);
 out_err:
@@ -832,6 +840,8 @@ static void i915_engines_cleanup(struct drm_i915_private *i915)
 
 static void i915_workqueues_cleanup(struct drm_i915_private *dev_priv)
 {
+	if (INTEL_INFO(dev_priv)->has_logical_ring_preemption)
+		destroy_workqueue(dev_priv->fpreempt_wq);
 	destroy_workqueue(dev_priv->hotplug.dp_wq);
 	destroy_workqueue(dev_priv->wq);
 }
@@ -2007,7 +2017,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
 
 	if (!(flags & I915_RESET_QUIET)) {
 		dev_notice(engine->i915->drm.dev,
-			   "Resetting %s after gpu hang\n", engine->name);
+			   "Resetting %s %s\n", engine->name,
+			   engine->fpreempt_active ?
+			   "for force preemption" : "after gpu hang");
 	}
 	error->reset_engine_count[engine->id]++;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ade09f97be5c..514e640d8406 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2290,6 +2290,8 @@ struct drm_i915_private {
 	 */
 	struct workqueue_struct *wq;
 
+	struct workqueue_struct *fpreempt_wq;
+
 	/* Display functions */
 	struct drm_i915_display_funcs display;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9780d9026ce6..d556743c578a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2811,9 +2811,21 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return request;
 }
 
-static bool engine_stalled(struct intel_engine_cs *engine)
+static bool engine_stalled(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
 	if (!intel_vgpu_active(engine->i915)) {
+		if (engine->fpreempt_active) {
+			/* Pardon the request if it managed to complete or
+			 * preempt prior to the reset.
+			 */
+			if (i915_gem_request_completed(request) ||
+			    intel_engine_preempt_finished(engine))
+				return false;
+
+			return true;
+		}
+
 		if (!engine->hangcheck.stalled)
 			return false;
 
@@ -2858,6 +2870,13 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	tasklet_kill(&engine->execlists.irq_tasklet);
 	tasklet_disable(&engine->execlists.irq_tasklet);
 
+	/* There may be a force preemption timer active on this engine but
+	 * not yet expired, i.e. not the reason we are about to reset this
+	 * engine. Cancel it. If force preemption timeout is the reason we
+	 * are resetting the engine, this call will have no efffect.
+	 */
+	intel_engine_cancel_fpreempt(engine);
+
 	if (engine->irq_seqno_barrier)
 		engine->irq_seqno_barrier(engine);
 
@@ -2958,7 +2977,7 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 	 * subsequent hangs.
 	 */
 
-	if (engine_stalled(engine)) {
+	if (engine_stalled(engine, request)) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
 
@@ -2966,6 +2985,13 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 		if (i915_gem_context_is_banned(request->ctx))
 			engine_skip_context(request);
 	} else {
+		/* If the request that we just pardoned was the target of a
+		 * force preemption there is no possibility of the next
+		 * request in line having started.
+		 */
+		if (engine->fpreempt_active)
+			return NULL;
+
 		/*
 		 * Since this is not the hung engine, it may have advanced
 		 * since the hang declaration. Double check by refinding
@@ -3035,6 +3061,13 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
 {
+	/* Mark any active force preemption as complete then kick
+	 * the tasklet.
+	 */
+	engine->fpreempt_active = false;
+	if (engine->execlists.first)
+		tasklet_schedule(&engine->execlists.irq_tasklet);
+
 	tasklet_enable(&engine->execlists.irq_tasklet);
 	kthread_unpark(engine->breadcrumbs.signaler);
 }
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 63751a1de74a..9a0deb6e3920 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -247,3 +247,6 @@ MODULE_PARM_DESC(enable_conformance_check, "To toggle the GVT guest conformance
 
 module_param_named(disable_gvt_fw_loading, i915_modparams.disable_gvt_fw_loading, bool, 0400);
 MODULE_PARM_DESC(disable_gvt_fw_loading, "Disable GVT-g fw loading.");
+
+i915_param_named(fpreempt_timeout, uint, 0600,
+	"Wait time in msecs before forcing a preemption with reset (0:never force [default])");
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index e8c2ba4cb1e6..65fbffa6333c 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -54,6 +54,7 @@
 	func(int, edp_vswing); \
 	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
+	func(unsigned int, fpreempt_timeout); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
 	func(bool, enable_cmd_parser); \
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index c9bf2347f7a4..af6cc2d0f7e9 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -411,6 +411,58 @@ static void intel_engine_init_execlist(struct intel_engine_cs *engine)
 	execlists->first = NULL;
 }
 
+void intel_engine_queue_fpreempt(struct intel_engine_cs *engine)
+{
+	unsigned int timeout = i915_modparams.fpreempt_timeout;
+
+	if (!timeout)
+		return;
+
+	GEM_BUG_ON(engine->fpreempt_active);
+	hrtimer_start(&engine->fpreempt_timer,
+		      ms_to_ktime(timeout), HRTIMER_MODE_REL);
+}
+
+bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
+{
+	hrtimer_cancel(&engine->fpreempt_timer);
+
+	return !engine->fpreempt_active;
+}
+
+static enum hrtimer_restart
+intel_engine_fpreempt_timer(struct hrtimer *hrtimer)
+{
+	struct intel_engine_cs *engine =
+		container_of(hrtimer, struct intel_engine_cs,
+			     fpreempt_timer);
+
+	engine->fpreempt_active = true;
+	queue_work(engine->i915->fpreempt_wq, &engine->fpreempt_work);
+
+	return HRTIMER_NORESTART;
+}
+
+static void intel_engine_fpreempt_work(struct work_struct *work)
+{
+	struct intel_engine_cs *engine =
+		container_of(work, struct intel_engine_cs,
+			     fpreempt_work);
+
+	i915_handle_reset(engine->i915, intel_engine_flag(engine));
+}
+
+static void intel_engine_init_fpreempt(struct intel_engine_cs *engine)
+{
+	if (!INTEL_INFO(engine->i915)->has_logical_ring_preemption)
+		return;
+
+	hrtimer_init(&engine->fpreempt_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	engine->fpreempt_timer.function = intel_engine_fpreempt_timer;
+	INIT_WORK(&engine->fpreempt_work, intel_engine_fpreempt_work);
+}
+
 /**
  * intel_engines_setup_common - setup engine state not requiring hw access
  * @engine: Engine to setup.
@@ -426,6 +478,7 @@ void intel_engine_setup_common(struct intel_engine_cs *engine)
 
 	intel_engine_init_timeline(engine);
 	intel_engine_init_hangcheck(engine);
+	intel_engine_init_fpreempt(engine);
 	i915_gem_batch_pool_init(engine, &engine->batch_pool);
 
 	intel_engine_init_cmd_parser(engine);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 581483886153..17487f8e8b4c 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -628,6 +628,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			inject_preempt_context(engine);
 			execlists_set_active(execlists,
 					     EXECLISTS_ACTIVE_PREEMPT);
+			intel_engine_queue_fpreempt(engine);
 			goto unlock;
 		} else {
 			/*
@@ -846,6 +847,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 		const u32 *buf =
 			&engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
 		unsigned int head, tail;
+		bool defer = false;
 
 		/* However GVT emulation depends upon intercepting CSB mmio */
 		if (unlikely(execlists->csb_use_mmio)) {
@@ -919,6 +921,21 @@ static void intel_lrc_irq_handler(unsigned long data)
 
 			if (status & GEN8_CTX_STATUS_ACTIVE_IDLE &&
 			    buf[2*head + 1] == PREEMPT_ID) {
+				/* Try to cancel any pending force preemption.
+				 * If we are too late, hold off on processing
+				 * the completed preemption until reset has
+				 * run its course. It should recognize that
+				 * the engine has preempted to idle then abort
+				 * the reset. Then we can resume processing
+				 * at this CSB head.
+				 */
+				if (!intel_engine_cancel_fpreempt(engine)) {
+					if (!head--)
+						head = GEN8_CSB_ENTRIES - 1;
+					defer = true;
+					break;
+				}
+
 				execlist_cancel_port_requests(execlists);
 
 				spin_lock_irq(&engine->timeline->lock);
@@ -973,11 +990,16 @@ static void intel_lrc_irq_handler(unsigned long data)
 			writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
 			       dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 		}
+
+		if (defer) {
+			__set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
+			goto out;
+		}
 	}
 
 	if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT))
 		execlists_dequeue(engine);
-
+out:
 	intel_uncore_forcewake_put(dev_priv, execlists->fw_domains);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 25eb23bc06eb..e259f954cdea 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -506,6 +506,10 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	struct hrtimer fpreempt_timer;
+	struct work_struct fpreempt_work;
+	bool fpreempt_active;
+
 	bool needs_cmd_parser;
 
 	/*
@@ -932,4 +936,6 @@ void intel_engines_reset_default_submission(struct drm_i915_private *i915);
 
 bool intel_engine_can_store_dword(struct intel_engine_cs *engine);
 
+void intel_engine_queue_fpreempt(struct intel_engine_cs *engine);
+bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine);
 #endif /* _INTEL_RINGBUFFER_H_ */
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* ✗ Fi.CI.BAT: failure for Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (7 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
@ 2018-03-16 18:59 ` Patchwork
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
  2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson
  10 siblings, 0 replies; 33+ messages in thread
From: Patchwork @ 2018-03-16 18:59 UTC (permalink / raw)
  To: jeff.mcgee; +Cc: intel-gfx

== Series Details ==

Series: Force preemption
URL   : https://patchwork.freedesktop.org/series/40120/
State : failure

== Summary ==

Applying: drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/intel_lrc.c).
error: could not build fake ancestor
Patch failed at 0001 drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-16 20:30   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:30 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:59)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Engine reset is fast. A context switch interrupt may be generated just
> prior to the reset such that the top half handler is racing with reset
> post-processing. The handler may set the irq_posted bit again after
> the reset code has cleared it to start fresh. Then the re-enabled
> tasklet will read the CSB head and tail from MMIO, which will be at
> the hardware reset values of 0 and 7 respectively, given that no
> actual CSB event has occurred since the reset. Mayhem then ensues as
> the tasklet starts processing invalid CSB entries.
> 
> We can handle this corner case without adding any new synchronization
> between the irq top half and the reset work item. The tasklet can
> just skip CSB processing if the tail is not sane.
> 
> This patch is required to support the force preemption feature.

Please note how this is handled currently, because this is not a new
problem.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
@ 2018-03-16 20:30   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:30 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:58)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> It is possible for the hardware to be reset just as a context is
> completing. The reset post-processing may see the seqno update and
> assume that the context escaped corruption, but the context may
> have been disrupted in the save out process. The corruption may
> screw up the HEAD and TAIL pointers such that the next submission
> of the context switches out without running the intended request.
> The GEM_BUG_ON will be hit in this situation, but it is not really
> a driver error. So make it a GEM_WARN_ON so that we notice while
> letting hangcheck detect and clean up.
> 
> This patch is required to support the force preemption feature.
> 
> Test: Run IGT gem_exec_fpreempt repeatedly with
>       CONFIG_DRM_I915_DEBUG_GEM.
> Change-Id: I87da4b16bad805fe48153a9ed9169900681ebba7
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index b44861459d24..7d93fcd56d34 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -920,7 +920,7 @@ static void intel_lrc_irq_handler(unsigned long data)
>                         GEM_BUG_ON(count == 0);
>                         if (--count == 0) {
>                                 GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
> -                               GEM_BUG_ON(!i915_gem_request_completed(rq));
> +                               GEM_WARN_ON(!i915_gem_request_completed(rq));

Nope, either way you broke pretty badly.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 7/8] drm/i915: Allow reset without error capture
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
@ 2018-03-16 20:33   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:33 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:04)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Pull the reset handling out of i915_handle_error() so that it can be
> called by that function and directly by the upcoming force preemption
> handler. This allows the force preemption handler to bypass the error
> capture that i915_handle_error() does before getting on with the
> reset. We do not want error capture for force preemption because it
> adds significant latency (~10 msecs measured on APL).

And we don't need to capture the error in many other cases; just add a
control flag.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
@ 2018-03-16 20:39   ` Chris Wilson
  2018-03-16 20:44   ` Chris Wilson
  1 sibling, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:39 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:05)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> The hardware can complete the requested preemption at only certain points
> in execution. Thus an uncooperative request that avoids those points can
> block a preemption indefinitely. Our only option to bound the preemption
> latency is to trigger reset and recovery just as we would if a request had
> hung the hardware. This is so-called forced preemption. This change adds
> that capability as an option for systems with extremely strict scheduling
> latency requirements for its high priority requests. This option must be
> used with great care. The force-preempted requests will be discarded at
> the point of reset, resulting in various degrees of disruption to the
> owning application up to and including crash.
> 
> The option is enabled by specifying a non-zero value for new i915 module
> parameter fpreempt_timeout. This value becomes the time in milliseconds
> after initiation of preemption at which the reset is triggered if the
> preemption has not completed normally.
> 
> Test: Run IGT gem_exec_fpreempt.
> Change-Id: Iafd3609012621c57fa9e490dfeeac46ae541b5c2
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         | 14 ++++++++-
>  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
>  drivers/gpu/drm/i915/i915_gem.c         | 37 +++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_params.c      |  3 ++
>  drivers/gpu/drm/i915/i915_params.h      |  1 +
>  drivers/gpu/drm/i915/intel_engine_cs.c  | 53 +++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++++
>  8 files changed, 136 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index b627370d5a9c..3f2394d61ea2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -811,8 +811,16 @@ static int i915_workqueues_init(struct drm_i915_private *dev_priv)
>         if (dev_priv->hotplug.dp_wq == NULL)
>                 goto out_free_wq;
>  
> +       if (INTEL_INFO(dev_priv)->has_logical_ring_preemption) {
> +               dev_priv->fpreempt_wq = alloc_ordered_workqueue("i915-fpreempt",
> +                                                               WQ_HIGHPRI);
> +               if (dev_priv->fpreempt_wq == NULL)
> +                       goto out_free_dp_wq;

Doesn't require ordering, the resets are if either required to fall back
to a full reset.

> +       }
>         return 0;
>  
> +out_free_dp_wq:
> +       destroy_workqueue(dev_priv->hotplug.dp_wq);
>  out_free_wq:
>         destroy_workqueue(dev_priv->wq);
>  out_err:
> @@ -832,6 +840,8 @@ static void i915_engines_cleanup(struct drm_i915_private *i915)
>  
>  static void i915_workqueues_cleanup(struct drm_i915_private *dev_priv)
>  {
> +       if (INTEL_INFO(dev_priv)->has_logical_ring_preemption)
> +               destroy_workqueue(dev_priv->fpreempt_wq);
>         destroy_workqueue(dev_priv->hotplug.dp_wq);
>         destroy_workqueue(dev_priv->wq);
>  }
> @@ -2007,7 +2017,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
>  
>         if (!(flags & I915_RESET_QUIET)) {
>                 dev_notice(engine->i915->drm.dev,
> -                          "Resetting %s after gpu hang\n", engine->name);
> +                          "Resetting %s %s\n", engine->name,
> +                          engine->fpreempt_active ?
> +                          "for force preemption" : "after gpu hang");
>         }
>         error->reset_engine_count[engine->id]++;
>  
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index ade09f97be5c..514e640d8406 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2290,6 +2290,8 @@ struct drm_i915_private {
>          */
>         struct workqueue_struct *wq;
>  
> +       struct workqueue_struct *fpreempt_wq;
> +
>         /* Display functions */
>         struct drm_i915_display_funcs display;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9780d9026ce6..d556743c578a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2811,9 +2811,21 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
>         return request;
>  }
>  
> -static bool engine_stalled(struct intel_engine_cs *engine)
> +static bool engine_stalled(struct intel_engine_cs *engine,
> +                          struct drm_i915_gem_request *request)
>  {
>         if (!intel_vgpu_active(engine->i915)) {
> +               if (engine->fpreempt_active) {
> +                       /* Pardon the request if it managed to complete or
> +                        * preempt prior to the reset.
> +                        */
> +                       if (i915_gem_request_completed(request) ||
> +                           intel_engine_preempt_finished(engine))
> +                               return false;
> +

No, once you pull the trigger just go through with it. You should never
pull the trigger if you are worried about the consequences.

> +                       return true;
> +               }
> +
>                 if (!engine->hangcheck.stalled)
>                         return false;
>  
> @@ -2858,6 +2870,13 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>         tasklet_kill(&engine->execlists.irq_tasklet);
>         tasklet_disable(&engine->execlists.irq_tasklet);
>  
> +       /* There may be a force preemption timer active on this engine but
> +        * not yet expired, i.e. not the reason we are about to reset this
> +        * engine. Cancel it. If force preemption timeout is the reason we
> +        * are resetting the engine, this call will have no efffect.
> +        */

Done by just clearing execlists->active.

> +       intel_engine_cancel_fpreempt(engine);
> +
>         if (engine->irq_seqno_barrier)
>                 engine->irq_seqno_barrier(engine);
>  
> @@ -2958,7 +2977,7 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
>          * subsequent hangs.
>          */
>  
> -       if (engine_stalled(engine)) {
> +       if (engine_stalled(engine, request)) {
>                 i915_gem_context_mark_guilty(request->ctx);
>                 skip_request(request);
>  
> @@ -2966,6 +2985,13 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
>                 if (i915_gem_context_is_banned(request->ctx))
>                         engine_skip_context(request);
>         } else {
> +               /* If the request that we just pardoned was the target of a
> +                * force preemption there is no possibility of the next
> +                * request in line having started.
> +                */
> +               if (engine->fpreempt_active)
> +                       return NULL;
> +
>                 /*
>                  * Since this is not the hung engine, it may have advanced
>                  * since the hang declaration. Double check by refinding
> @@ -3035,6 +3061,13 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>  
>  void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
>  {
> +       /* Mark any active force preemption as complete then kick
> +        * the tasklet.
> +        */
> +       engine->fpreempt_active = false;
> +       if (engine->execlists.first)
> +               tasklet_schedule(&engine->execlists.irq_tasklet);

Stray chunk.

> +
>         tasklet_enable(&engine->execlists.irq_tasklet);
>         kthread_unpark(engine->breadcrumbs.signaler);
>  }
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 63751a1de74a..9a0deb6e3920 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -247,3 +247,6 @@ MODULE_PARM_DESC(enable_conformance_check, "To toggle the GVT guest conformance
>  
>  module_param_named(disable_gvt_fw_loading, i915_modparams.disable_gvt_fw_loading, bool, 0400);
>  MODULE_PARM_DESC(disable_gvt_fw_loading, "Disable GVT-g fw loading.");
> +
> +i915_param_named(fpreempt_timeout, uint, 0600,
> +       "Wait time in msecs before forcing a preemption with reset (0:never force [default])");
> diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
> index e8c2ba4cb1e6..65fbffa6333c 100644
> --- a/drivers/gpu/drm/i915/i915_params.h
> +++ b/drivers/gpu/drm/i915/i915_params.h
> @@ -54,6 +54,7 @@
>         func(int, edp_vswing); \
>         func(int, reset); \
>         func(unsigned int, inject_load_failure); \
> +       func(unsigned int, fpreempt_timeout); \
>         /* leave bools at the end to not create holes */ \
>         func(bool, alpha_support); \
>         func(bool, enable_cmd_parser); \
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index c9bf2347f7a4..af6cc2d0f7e9 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -411,6 +411,58 @@ static void intel_engine_init_execlist(struct intel_engine_cs *engine)
>         execlists->first = NULL;
>  }
>  
> +void intel_engine_queue_fpreempt(struct intel_engine_cs *engine)
> +{
> +       unsigned int timeout = i915_modparams.fpreempt_timeout;
> +
> +       if (!timeout)
> +               return;
> +
> +       GEM_BUG_ON(engine->fpreempt_active);
> +       hrtimer_start(&engine->fpreempt_timer,
> +                     ms_to_ktime(timeout), HRTIMER_MODE_REL);
> +}
> +
> +bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
> +{
> +       hrtimer_cancel(&engine->fpreempt_timer);
> +
> +       return !engine->fpreempt_active;
> +}
> +
> +static enum hrtimer_restart
> +intel_engine_fpreempt_timer(struct hrtimer *hrtimer)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(hrtimer, struct intel_engine_cs,
> +                            fpreempt_timer);
> +
> +       engine->fpreempt_active = true;
> +       queue_work(engine->i915->fpreempt_wq, &engine->fpreempt_work);

Ah I see you didn't check.

> +
> +       return HRTIMER_NORESTART;
> +}
> +
> +static void intel_engine_fpreempt_work(struct work_struct *work)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(work, struct intel_engine_cs,
> +                            fpreempt_work);
> +
> +       i915_handle_reset(engine->i915, intel_engine_flag(engine));
> +}
> +
> +static void intel_engine_init_fpreempt(struct intel_engine_cs *engine)
> +{
> +       if (!INTEL_INFO(engine->i915)->has_logical_ring_preemption)
> +               return;
> +
> +       hrtimer_init(&engine->fpreempt_timer,
> +                    CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +       engine->fpreempt_timer.function = intel_engine_fpreempt_timer;
> +       INIT_WORK(&engine->fpreempt_work, intel_engine_fpreempt_work);

Just set the few pointers. It's not like we are saving anything.

> +}
> +
>  /**
>   * intel_engines_setup_common - setup engine state not requiring hw access
>   * @engine: Engine to setup.
> @@ -426,6 +478,7 @@ void intel_engine_setup_common(struct intel_engine_cs *engine)
>  
>         intel_engine_init_timeline(engine);
>         intel_engine_init_hangcheck(engine);
> +       intel_engine_init_fpreempt(engine);
>         i915_gem_batch_pool_init(engine, &engine->batch_pool);
>  
>         intel_engine_init_cmd_parser(engine);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 581483886153..17487f8e8b4c 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -628,6 +628,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>                         inject_preempt_context(engine);
>                         execlists_set_active(execlists,
>                                              EXECLISTS_ACTIVE_PREEMPT);
> +                       intel_engine_queue_fpreempt(engine);
>                         goto unlock;
>                 } else {
>                         /*
> @@ -846,6 +847,7 @@ static void intel_lrc_irq_handler(unsigned long data)
>                 const u32 *buf =
>                         &engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
>                 unsigned int head, tail;
> +               bool defer = false;
>  
>                 /* However GVT emulation depends upon intercepting CSB mmio */
>                 if (unlikely(execlists->csb_use_mmio)) {
> @@ -919,6 +921,21 @@ static void intel_lrc_irq_handler(unsigned long data)
>  
>                         if (status & GEN8_CTX_STATUS_ACTIVE_IDLE &&
>                             buf[2*head + 1] == PREEMPT_ID) {
> +                               /* Try to cancel any pending force preemption.
> +                                * If we are too late, hold off on processing
> +                                * the completed preemption until reset has
> +                                * run its course. It should recognize that
> +                                * the engine has preempted to idle then abort
> +                                * the reset. Then we can resume processing
> +                                * at this CSB head.
> +                                */

This can be much simpler than implied here.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
  2018-03-16 20:39   ` Chris Wilson
@ 2018-03-16 20:44   ` Chris Wilson
  1 sibling, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:44 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:05)
> +bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
> +{
> +       hrtimer_cancel(&engine->fpreempt_timer);

Can not be used inside the tasklet as the hrtimer may also be a tasklet
on the same CPU; so busy spin deadlock.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (8 preceding siblings ...)
  2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
@ 2018-03-16 20:53 ` Chris Wilson
  2018-03-16 21:03   ` Chris Wilson
  2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson
  10 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 20:53 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:57)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

Last night I spent 15mins and wrote a similar RFC as Joonas keep
muttering and I thought he was looking for code not that was some...

The real only difference is the approach to handling the pending timer.
I do not agree with your reset changes as they appear to percolate even
more uncertainity through the code. If the preempt-context is in flight
when the reset is triggered, the last active request is still defined by
incomplete request; the new request has not been submitted. This is
serialised by checking for EXECLISTS_ACTIVE_PREEMPT. Up until the point
that is cleared and the next request submitted; we should reset the
engine and resubmit. The race in tasklet handling is immaterial at that
point, you have already paid the cost in using a timer, the workqueue
and serialising for the reset.

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 76ea1da923bd..d9da68efbc86 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3969,8 +3969,8 @@ i915_wedged_set(void *data, u64 val)
                engine->hangcheck.stalled = true;
        }
 
-       i915_handle_error(i915, val, "Manually set wedged engine mask = %llx",
-                         val);
+       i915_handle_error(i915, val, I915_ERROR_CAPTURE,
+                         "Manually set wedged engine mask = %llx", val);
 
        wait_on_bit(&i915->gpu_error.flags,
                    I915_RESET_HANDOFF,
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 26c1dd4b542e..d817b5e55c96 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2755,10 +2755,12 @@ static inline void i915_queue_hangcheck(struct drm_i915_private *dev_priv)
                           &dev_priv->gpu_error.hangcheck_work, delay);
 }
 
-__printf(3, 4)
+__printf(4, 5)
 void i915_handle_error(struct drm_i915_private *dev_priv,
                       u32 engine_mask,
+                      unsigned long flags,
                       const char *fmt, ...);
+#define I915_ERROR_CAPTURE BIT(0)
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5aaf356fd27f..0a255ffe4c51 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2593,6 +2593,7 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @flags: control flags
  * @fmt: Error message format string
  *
  * Do some basic checking of register state at error time and
@@ -2603,16 +2604,11 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  */
 void i915_handle_error(struct drm_i915_private *dev_priv,
                       u32 engine_mask,
+                      unsigned long flags,
                       const char *fmt, ...)
 {
        struct intel_engine_cs *engine;
        unsigned int tmp;
-       va_list args;
-       char error_msg[80];
-
-       va_start(args, fmt);
-       vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-       va_end(args);
 
        /*
         * In most cases it's guaranteed that we get here with an RPM
@@ -2624,8 +2620,18 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
        intel_runtime_pm_get(dev_priv);
 
        engine_mask &= DEVICE_INFO(dev_priv)->ring_mask;
-       i915_capture_error_state(dev_priv, engine_mask, error_msg);
-       i915_clear_error_registers(dev_priv);
+
+       if (flags & I915_ERROR_CAPTURE) {
+               char error_msg[80];
+               va_list args;
+
+               va_start(args, fmt);
+               vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+               va_end(args);
+
+               i915_capture_error_state(dev_priv, engine_mask, error_msg);
+               i915_clear_error_registers(dev_priv);
+       }
 
        /*
         * Try engine reset when available. We fall back to full reset if
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 42e45ae87393..13d1a269c771 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -246,7 +246,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
         */
        tmp = I915_READ_CTL(engine);
        if (tmp & RING_WAIT) {
-               i915_handle_error(dev_priv, 0,
+               i915_handle_error(dev_priv, 0, 0,
                                  "Kicking stuck wait on %s",
                                  engine->name);
                I915_WRITE_CTL(engine, tmp);
@@ -258,7 +258,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
                default:
                        return ENGINE_DEAD;
                case 1:
-                       i915_handle_error(dev_priv, 0,
+                       i915_handle_error(dev_priv, 0, 0,
                                          "Kicking stuck semaphore on %s",
                                          engine->name);
                        I915_WRITE_CTL(engine, tmp);
@@ -392,7 +392,7 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
                                 "%s, ", engine->name);
        msg[len-2] = '\0';
 
-       return i915_handle_error(i915, hung, "%s", msg);
+       return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 396abef55dd3..55f25241a199 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -529,8 +529,35 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
        if (execlists->ctrl_reg)
                writel(EL_CTRL_LOAD, execlists->ctrl_reg);
 
-       execlists_clear_active(&engine->execlists, EXECLISTS_ACTIVE_HWACK);
-       execlists_set_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT);
+       execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
+       execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+       /* Set a timer to force preemption vs hostile userspace */
+       if (execlists->preempt_timeout_ns)
+               hrtimer_start(&execlists->preempt_timer,
+                             ktime_set(0, execlists->preempt_timeout_ns),
+                             HRTIMER_MODE_REL);
+}
+
+static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
+{
+       struct intel_engine_cs *engine =
+               container_of(hrtimer, typeof(*engine), execlists.preempt_timer);
+
+       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+               queue_work(system_highpri_wq, &engine->execlists.preempt_reset);
+
+       return HRTIMER_NORESTART;
+}
+
+static void preempt_reset(struct work_struct *work)
+{
+       struct intel_engine_cs *engine =
+               container_of(work, typeof(*engine), execlists.preempt_reset);
+
+       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+               i915_handle_error(engine->i915, BIT(engine->id), 0,
+                                 "preemption timed out on %s", engine->name);
 }
 
 static void update_rps(struct intel_engine_cs *engine)
@@ -946,6 +973,7 @@ static void execlists_submission_tasklet(unsigned long data)
                                                                EXECLISTS_ACTIVE_PREEMPT));
                                execlists_clear_active(execlists,
                                                       EXECLISTS_ACTIVE_PREEMPT);
+                               hrtimer_try_to_cancel(&execlists->preempt_timer);
                                continue;
                        }
 
@@ -2180,6 +2208,11 @@ logical_ring_setup(struct intel_engine_cs *engine)
        tasklet_init(&engine->execlists.tasklet,
                     execlists_submission_tasklet, (unsigned long)engine);
 
+       INIT_WORK(&engine->execlists.preempt_reset, preempt_reset);
+       hrtimer_init(&engine->execlists.preempt_timer,
+                    CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+       engine->execlists.preempt_timer.function = preempt_timeout;
+
        logical_ring_default_vfuncs(engine);
        logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 495b21fc33db..74fcff8a2a6e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -307,6 +307,10 @@ struct intel_engine_execlists {
         * @preempt_complete_status: expected CSB upon completing preemption
         */
        u32 preempt_complete_status;
+
+       struct hrtimer preempt_timer;
+       struct work_struct preempt_reset;
+       unsigned long preempt_timeout_ns;
 };
 
 #define INTEL_ENGINE_CS_MAX_NAME 8
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index c61a51e46eb2..e99705068190 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -1091,7 +1091,7 @@ static int igt_handle_error(void *arg)
        engine->hangcheck.stalled = true;
        engine->hangcheck.seqno = intel_engine_get_seqno(engine);
 
-       i915_handle_error(i915, intel_engine_flag(engine), "%s", __func__);
+       i915_handle_error(i915, intel_engine_flag(engine), 0, "%s", __func__);
 
        xchg(&i915->gpu_error.first_error, error);


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
@ 2018-03-16 21:03   ` Chris Wilson
  2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
  2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
  0 siblings, 2 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 21:03 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Chris Wilson (2018-03-16 20:53:01)
> +static void preempt_reset(struct work_struct *work)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(work, typeof(*engine), execlists.preempt_reset);
> +

So thinking about the races you had in the reset, you need

tasklet_kill()
tasklet_disable();

> +       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
> +               i915_handle_error(engine->i915, BIT(engine->id), 0,
> +                                 "preemption timed out on %s", engine->name);

tasklet_enable();

here for the serialisation on EXECLISTS_ACTIVE_PREEMPT.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (9 preceding siblings ...)
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
@ 2018-03-16 22:34 ` Chris Wilson
  10 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 22:34 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:57)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

So one big problem I should have realised earlier is ksoftirqd screws up
any latency sensitive operation. I have an ugly idea for a hack, but we
really should cap the latency problems with tasklets.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] force-preempt-reset
  2018-03-16 21:03   ` Chris Wilson
@ 2018-03-16 23:04     ` Chris Wilson
  2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
  1 sibling, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-16 23:04 UTC (permalink / raw)
  To: intel-gfx

A more concrete idea of what I think the serialisation should be between
reset timer and tasklet.

---
 drivers/gpu/drm/i915/intel_lrc.c        | 43 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 +++
 2 files changed, 47 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 2c2eb6d3868a..b6fd4323663d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -531,6 +531,39 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
 
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
 	execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+	/* Set a timer to force preemption vs hostile userspace */
+	if (execlists->preempt_timeout_ns)
+		hrtimer_start(&execlists->preempt_timer,
+			      ktime_set(0, execlists->preempt_timeout_ns),
+			      HRTIMER_MODE_REL);
+}
+
+static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
+{
+	struct intel_engine_cs *engine =
+		container_of(hrtimer, typeof(*engine), execlists.preempt_timer);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+		queue_work(system_highpri_wq, &engine->execlists.preempt_reset);
+
+	return HRTIMER_NORESTART;
+}
+
+static void preempt_reset(struct work_struct *work)
+{
+	struct intel_engine_cs *engine =
+		container_of(work, typeof(*engine), execlists.preempt_reset);
+
+	tasklet_disable(&engine->execlists.tasklet);
+
+	engine->execlists.tasklet.func(engine->execlists.tasklet.data);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+		i915_handle_error(engine->i915, BIT(engine->id), 0,
+				  "preemption timed out on %s", engine->name);
+
+	tasklet_enable(&engine->execlists.tasklet);
 }
 
 static void complete_preempt_context(struct intel_engine_execlists *execlists)
@@ -539,6 +572,11 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
 	execlists_unwind_incomplete_requests(execlists);
 
 	GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
+
+	/* If the timer already fired, complete the reset */
+	if (hrtimer_try_to_cancel(&execlists->preempt_timer) < 0)
+		return;
+
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
 }
 
@@ -2182,6 +2220,11 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->execlists.tasklet,
 		     execlists_submission_tasklet, (unsigned long)engine);
 
+	INIT_WORK(&engine->execlists.preempt_reset, preempt_reset);
+	hrtimer_init(&engine->execlists.preempt_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	engine->execlists.preempt_timer.function = preempt_timeout;
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 495b21fc33db..74fcff8a2a6e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -307,6 +307,10 @@ struct intel_engine_execlists {
 	 * @preempt_complete_status: expected CSB upon completing preemption
 	 */
 	u32 preempt_complete_status;
+
+	struct hrtimer preempt_timer;
+	struct work_struct preempt_reset;
+	unsigned long preempt_timeout_ns;
 };
 
 #define INTEL_ENGINE_CS_MAX_NAME 8
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* ✗ Fi.CI.BAT: failure for force-preempt-reset
  2018-03-16 21:03   ` Chris Wilson
  2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
@ 2018-03-16 23:45     ` Patchwork
  1 sibling, 0 replies; 33+ messages in thread
From: Patchwork @ 2018-03-16 23:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: force-preempt-reset
URL   : https://patchwork.freedesktop.org/series/40136/
State : failure

== Summary ==

Applying: force-preempt-reset
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/intel_lrc.c).
error: could not build fake ancestor
Patch failed at 0001 force-preempt-reset
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-23 13:20                 ` Joonas Lahtinen
@ 2018-03-23 13:37                   ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2018-03-23 13:37 UTC (permalink / raw)
  To: Joonas Lahtinen, Bloomfield, Jon, Mcgee, Jeff, Tvrtko Ursulin
  Cc: Kondapally, Kalyan, intel-gfx, ben

Quoting Joonas Lahtinen (2018-03-23 13:20:40)
> So far nobody has been succesful in selling this to the userspace
> compositors (the most likely user) or has somebody?

I hadn't even contemplated selling it. However, it does seem applicable
to the RealTime priorities for Vk and 
https://www.khronos.org/registry/EGL/extensions/NV/EGL_NV_context_priority_realtime.txt

Gah. More fiddling with EGL...

Personally I don't think I'd want X/gnome-shell/weston to adopt a
RealTime or bust philosophy (i.e. I'm not planning to send a patch to
raise weston from PRIORITY_HIGH to PRIORITY_RT), but that doesn't mean
there aren't scenarios where it will be important.

So a context param, say PREEMPT_TIMEOUT in ns with 0 meaning disable and
bounded by CAP_SYS_NICE. Too soft? CAP_SYS_ADMIN, since it's about
giving permission to nuke others? Implementation side, having it on the
context is a fiddle, but we should be able to set a queue_timeout
at the same time as setting queue_priority to save having to find the
preempting context at irq time.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 19:59               ` Bloomfield, Jon
@ 2018-03-23 13:20                 ` Joonas Lahtinen
  2018-03-23 13:37                   ` Chris Wilson
  0 siblings, 1 reply; 33+ messages in thread
From: Joonas Lahtinen @ 2018-03-23 13:20 UTC (permalink / raw)
  To: Bloomfield, Jon, Mcgee, Jeff, Tvrtko Ursulin
  Cc: Kondapally, Kalyan, intel-gfx, ben

Quoting Bloomfield, Jon (2018-03-22 21:59:33)
> > From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf
> > Of Jeff McGee
> > Sent: Thursday, March 22, 2018 12:09 PM
> > To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Cc: Kondapally, Kalyan <kalyan.kondapally@intel.com>; intel-
> > gfx@lists.freedesktop.org; ben@bwidawsk.net
> > Subject: Re: [Intel-gfx] [RFC 0/8] Force preemption
> > 
> > On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> > >
> > > On 22/03/2018 16:01, Jeff McGee wrote:
> > > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> > > >>
> > > >>On 22/03/2018 14:34, Jeff McGee wrote:
> > > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > >>>>>
> > > >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> > > >>>>>>
> > > >>>>>>Force preemption uses engine reset to enforce a limit on the time
> > > >>>>>>that a request targeted for preemption can block. This feature is
> > > >>>>>>a requirement in automotive systems where the GPU may be
> > shared by
> > > >>>>>>clients of critically high priority and clients of low priority that
> > > >>>>>>may not have been curated to be preemption friendly. There may
> > be
> > > >>>>>>more general applications of this feature. I'm sharing as an RFC to
> > > >>>>>>stimulate that discussion and also to get any technical feedback
> > > >>>>>>that I can before submitting to the product kernel that needs this.
> > > >>>>>>I have developed the patches for ease of rebase, given that this is
> > > >>>>>>for the moment considered a non-upstreamable feature. It would
> > be
> > > >>>>>>possible to refactor hangcheck to fully incorporate force
> > preemption
> > > >>>>>>as another tier of patience (or impatience) with the running
> > request.
> > > >>>>>
> > > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this
> > work
> > > >>>>>only with stateless clients - or in other words, what would happen to
> > > >>>>>stateful clients which would be force preempted? Or the answer is
> > we
> > > >>>>>don't care since they are misbehaving?
> > > >>>>
> > > >>>>They get notified of being guilty for causing a gpu reset; three strikes
> > > >>>>and they are out (banned from using the gpu) using the current rules.
> > > >>>>This is a very blunt hammer that requires the rest of the system to be
> > > >>>>robust; one might argue time spent making the system robust would
> > be
> > > >>>>better served making sure that the timer never expired in the first
> > place
> > > >>>>thereby eliminating the need for a forced gpu reset.
> > > >>>>-Chris
> > > >>>
> > > >>>Yes, for simplication the policy applied to force preempted contexts
> > > >>>is the same as for hanging contexts. It is known that this feature
> > > >>>should not be required in a fully curated system. It's a requirement
> > > >>>if end user will be alllowed to install 3rd party apps to run in the
> > > >>>non-critical domain.
> > > >>
> > > >>My concern is whether it safe to call this force _preemption_, while
> > > >>it is not really expected to work as preemption from the point of
> > > >>view of preempted context. I may be missing some angle here, but I
> > > >>think a better name would include words like maximum request
> > > >>duration or something.
> > > >>
> > > >>I can see a difference between allowed maximum duration when there
> > > >>is something else pending, and when it isn't, but I don't
> > > >>immediately see that we should consider this distinction for any
> > > >>real benefit?
> > > >>
> > > >>So should the feature just be "maximum request duration"? This would
> > > >>perhaps make it just a special case of hangcheck, which ignores head
> > > >>progress, or whatever we do in there.
> > > >>
> > > >>Regards,
> > > >>
> > > >>Tvrtko
> > > >
> > > >I think you might be unclear about how this works. We're not starting a
> > > >preemption to see if we can cleanly remove a request who has begun to
> > > >exceed its normal time slice, i.e. hangcheck. This is about bounding
> > > >the time that a normal preemption can take. So first start preemption
> > > >in response to higher-priority request arrival, then wait for preemption
> > > >to complete within a certain amount of time. If it does not, resort to
> > > >reset.
> > > >
> > > >So it's really "force the resolution of a preemption", shortened to
> > > >"force preemption".
> > >
> > > You are right, I veered off in my thinking and ended up with
> > > something different. :)
> > >
> > > I however still think the name is potentially misleading, since the
> > > request/context is not getting preempted. It is getting effectively
> > > killed (sooner or later, directly or indirectly).
> > >
> > > Maybe that is OK for the specific use case when everything is only
> > > broken and not malicious.
> > >
> > > In a more general purpose system it would be a bit random when
> > > something would work, and when it wouldn't, depending on system
> > > setup and even timings.
> > >
> > > Hm, maybe you don't even really benefit from the standard three
> > > strikes and you are out policy, and for this specific use case, you
> > > should just kill it straight away. If it couldn't be preempted once,
> > > why pay the penalty any more?
> > >
> > > If you don't have it already, devising a solution which blacklists
> > > the process (if it creates more contexts), or even a parent (if
> > > forking is applicable and implementation feasible), for offenders
> > > could also be beneficial.
> > >
> > > Regards,
> > >
> > > Tvrtko
> > 
> > Fair enough. There wasn't a lot of deliberation on this name. We
> > referred to it in various ways during development. I think I started
> > using "force preemption" because it was short. "reset to preempt" was
> > another phrase that was used.
> > 
> > The handling of the guilty client/context could be tailored more. Like
> > I said it was easiest to start with the same sort of handling that we
> > have already for hang scenarios. Simple is good when you are rebasing
> > without much hope to upstream. :(
> > 
> > If there was interest in upstreaming this capability, we could certainly
> > incorporate nicely within a refactoring of hangcheck. And then we
> > wouldn't even need a special name for it. The whole thing could be recast
> > as time slice management, where your slice is condition-based. You get
> > unlimited time if no one wants the engine, X time if someone of equal
> > priority wants the engine, and Y time if someone of higher priority
> > wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
> > value in my RFC.
> > 
> On the subject of it being "a bit random when something would work":
> Currently it's the well behaved app (and everything else behind it)
> that pays the price for the badly-behaved umd being a bad citizen. 12s is
> a really long time for the GUI to freeze before a bad context can be nuked.

But the price is not that high when you get a momentary freeze of the
system, compared to applications dying on the user and unsaved work
getting lost, for example.

> Yes, with enforced pre-emption latency, the bad umd pays a price in that,
> if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad
> umd at the end of the day. And we should find it and fix it. 

I think the key issue is the difference in difficulty between finding an
offending GPU hogger in a completely controlled product environment vs. in
the wild in desktop environment.

In the wild an application could've worked for years without problems
(being a GPU hogger, whatever the limit is set at), and stops working when
another application is introduced that demands to use the GPU at a higher
priority and the GPU hogging application won't slow down, but rather
will just die.

Requiring all userspace applications to suddenly have a short ARB check
period, or being in danger of getting killed is not a light change to
make in the generic desktop environment.

So some clear opt-in in the form of "Sacrifice everything to keep running
at 60 FPS [ ]" tick-box from compositor would be required.

So far nobody has been succesful in selling this to the userspace
compositors (the most likely user) or has somebody?

Regards, Joonas

> With fair-scheduling, this becomes even more important - all umd's need
> to ensure that they are capable of pre-empting to the finest granularity
> supported by the h/w for their respective workloads. If they don't
> they really should be obliterated (aka detected and fixed).
> 
> The big benefit I see with the new approach (vs TDR) is that there's actually
> no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a
> context at all, providing it plays nicely. If you want to run a really long
> GPGPU workload, do so by all means, just don't hog the GPU.
> 
> BTW, the same context banning should be applied to a context nuked
> by forced-preemption as with TDR. That's the intention at least. This is just
> an RFC right now.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 19:08             ` Jeff McGee
@ 2018-03-22 19:59               ` Bloomfield, Jon
  2018-03-23 13:20                 ` Joonas Lahtinen
  0 siblings, 1 reply; 33+ messages in thread
From: Bloomfield, Jon @ 2018-03-22 19:59 UTC (permalink / raw)
  To: Mcgee, Jeff, Tvrtko Ursulin; +Cc: Kondapally, Kalyan, intel-gfx, ben

> From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf
> Of Jeff McGee
> Sent: Thursday, March 22, 2018 12:09 PM
> To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Kondapally, Kalyan <kalyan.kondapally@intel.com>; intel-
> gfx@lists.freedesktop.org; ben@bwidawsk.net
> Subject: Re: [Intel-gfx] [RFC 0/8] Force preemption
> 
> On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> >
> > On 22/03/2018 16:01, Jeff McGee wrote:
> > >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> > >>
> > >>On 22/03/2018 14:34, Jeff McGee wrote:
> > >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > >>>>>
> > >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> > >>>>>>
> > >>>>>>Force preemption uses engine reset to enforce a limit on the time
> > >>>>>>that a request targeted for preemption can block. This feature is
> > >>>>>>a requirement in automotive systems where the GPU may be
> shared by
> > >>>>>>clients of critically high priority and clients of low priority that
> > >>>>>>may not have been curated to be preemption friendly. There may
> be
> > >>>>>>more general applications of this feature. I'm sharing as an RFC to
> > >>>>>>stimulate that discussion and also to get any technical feedback
> > >>>>>>that I can before submitting to the product kernel that needs this.
> > >>>>>>I have developed the patches for ease of rebase, given that this is
> > >>>>>>for the moment considered a non-upstreamable feature. It would
> be
> > >>>>>>possible to refactor hangcheck to fully incorporate force
> preemption
> > >>>>>>as another tier of patience (or impatience) with the running
> request.
> > >>>>>
> > >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this
> work
> > >>>>>only with stateless clients - or in other words, what would happen to
> > >>>>>stateful clients which would be force preempted? Or the answer is
> we
> > >>>>>don't care since they are misbehaving?
> > >>>>
> > >>>>They get notified of being guilty for causing a gpu reset; three strikes
> > >>>>and they are out (banned from using the gpu) using the current rules.
> > >>>>This is a very blunt hammer that requires the rest of the system to be
> > >>>>robust; one might argue time spent making the system robust would
> be
> > >>>>better served making sure that the timer never expired in the first
> place
> > >>>>thereby eliminating the need for a forced gpu reset.
> > >>>>-Chris
> > >>>
> > >>>Yes, for simplication the policy applied to force preempted contexts
> > >>>is the same as for hanging contexts. It is known that this feature
> > >>>should not be required in a fully curated system. It's a requirement
> > >>>if end user will be alllowed to install 3rd party apps to run in the
> > >>>non-critical domain.
> > >>
> > >>My concern is whether it safe to call this force _preemption_, while
> > >>it is not really expected to work as preemption from the point of
> > >>view of preempted context. I may be missing some angle here, but I
> > >>think a better name would include words like maximum request
> > >>duration or something.
> > >>
> > >>I can see a difference between allowed maximum duration when there
> > >>is something else pending, and when it isn't, but I don't
> > >>immediately see that we should consider this distinction for any
> > >>real benefit?
> > >>
> > >>So should the feature just be "maximum request duration"? This would
> > >>perhaps make it just a special case of hangcheck, which ignores head
> > >>progress, or whatever we do in there.
> > >>
> > >>Regards,
> > >>
> > >>Tvrtko
> > >
> > >I think you might be unclear about how this works. We're not starting a
> > >preemption to see if we can cleanly remove a request who has begun to
> > >exceed its normal time slice, i.e. hangcheck. This is about bounding
> > >the time that a normal preemption can take. So first start preemption
> > >in response to higher-priority request arrival, then wait for preemption
> > >to complete within a certain amount of time. If it does not, resort to
> > >reset.
> > >
> > >So it's really "force the resolution of a preemption", shortened to
> > >"force preemption".
> >
> > You are right, I veered off in my thinking and ended up with
> > something different. :)
> >
> > I however still think the name is potentially misleading, since the
> > request/context is not getting preempted. It is getting effectively
> > killed (sooner or later, directly or indirectly).
> >
> > Maybe that is OK for the specific use case when everything is only
> > broken and not malicious.
> >
> > In a more general purpose system it would be a bit random when
> > something would work, and when it wouldn't, depending on system
> > setup and even timings.
> >
> > Hm, maybe you don't even really benefit from the standard three
> > strikes and you are out policy, and for this specific use case, you
> > should just kill it straight away. If it couldn't be preempted once,
> > why pay the penalty any more?
> >
> > If you don't have it already, devising a solution which blacklists
> > the process (if it creates more contexts), or even a parent (if
> > forking is applicable and implementation feasible), for offenders
> > could also be beneficial.
> >
> > Regards,
> >
> > Tvrtko
> 
> Fair enough. There wasn't a lot of deliberation on this name. We
> referred to it in various ways during development. I think I started
> using "force preemption" because it was short. "reset to preempt" was
> another phrase that was used.
> 
> The handling of the guilty client/context could be tailored more. Like
> I said it was easiest to start with the same sort of handling that we
> have already for hang scenarios. Simple is good when you are rebasing
> without much hope to upstream. :(
> 
> If there was interest in upstreaming this capability, we could certainly
> incorporate nicely within a refactoring of hangcheck. And then we
> wouldn't even need a special name for it. The whole thing could be recast
> as time slice management, where your slice is condition-based. You get
> unlimited time if no one wants the engine, X time if someone of equal
> priority wants the engine, and Y time if someone of higher priority
> wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
> value in my RFC.
> 
On the subject of it being "a bit random when something would work":
Currently it's the well behaved app (and everything else behind it)
that pays the price for the badly-behaved umd being a bad citizen. 12s is
a really long time for the GUI to freeze before a bad context can be nuked.

Yes, with enforced pre-emption latency, the bad umd pays a price in that,
if it's lucky, it will run to completion. If unlucky it gets nuked. But it's a bad
umd at the end of the day. And we should find it and fix it. 

With fair-scheduling, this becomes even more important - all umd's need
to ensure that they are capable of pre-empting to the finest granularity
supported by the h/w for their respective workloads. If they don't
they really should be obliterated (aka detected and fixed).

The big benefit I see with the new approach (vs TDR) is that there's actually
no need to enforce the (slightly arbitrarily determined 3x4s timeout) for a
context at all, providing it plays nicely. If you want to run a really long
GPGPU workload, do so by all means, just don't hog the GPU.

BTW, the same context banning should be applied to a context nuked
by forced-preemption as with TDR. That's the intention at least. This is just
an RFC right now.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 17:41           ` Tvrtko Ursulin
@ 2018-03-22 19:08             ` Jeff McGee
  2018-03-22 19:59               ` Bloomfield, Jon
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff McGee @ 2018-03-22 19:08 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: kalyan.kondapally, intel-gfx, ben

On Thu, Mar 22, 2018 at 05:41:57PM +0000, Tvrtko Ursulin wrote:
> 
> On 22/03/2018 16:01, Jeff McGee wrote:
> >On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 22/03/2018 14:34, Jeff McGee wrote:
> >>>On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> >>>>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> >>>>>
> >>>>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> >>>>>>From: Jeff McGee <jeff.mcgee@intel.com>
> >>>>>>
> >>>>>>Force preemption uses engine reset to enforce a limit on the time
> >>>>>>that a request targeted for preemption can block. This feature is
> >>>>>>a requirement in automotive systems where the GPU may be shared by
> >>>>>>clients of critically high priority and clients of low priority that
> >>>>>>may not have been curated to be preemption friendly. There may be
> >>>>>>more general applications of this feature. I'm sharing as an RFC to
> >>>>>>stimulate that discussion and also to get any technical feedback
> >>>>>>that I can before submitting to the product kernel that needs this.
> >>>>>>I have developed the patches for ease of rebase, given that this is
> >>>>>>for the moment considered a non-upstreamable feature. It would be
> >>>>>>possible to refactor hangcheck to fully incorporate force preemption
> >>>>>>as another tier of patience (or impatience) with the running request.
> >>>>>
> >>>>>Sorry if it was mentioned elsewhere and I missed it - but does this work
> >>>>>only with stateless clients - or in other words, what would happen to
> >>>>>stateful clients which would be force preempted? Or the answer is we
> >>>>>don't care since they are misbehaving?
> >>>>
> >>>>They get notified of being guilty for causing a gpu reset; three strikes
> >>>>and they are out (banned from using the gpu) using the current rules.
> >>>>This is a very blunt hammer that requires the rest of the system to be
> >>>>robust; one might argue time spent making the system robust would be
> >>>>better served making sure that the timer never expired in the first place
> >>>>thereby eliminating the need for a forced gpu reset.
> >>>>-Chris
> >>>
> >>>Yes, for simplication the policy applied to force preempted contexts
> >>>is the same as for hanging contexts. It is known that this feature
> >>>should not be required in a fully curated system. It's a requirement
> >>>if end user will be alllowed to install 3rd party apps to run in the
> >>>non-critical domain.
> >>
> >>My concern is whether it safe to call this force _preemption_, while
> >>it is not really expected to work as preemption from the point of
> >>view of preempted context. I may be missing some angle here, but I
> >>think a better name would include words like maximum request
> >>duration or something.
> >>
> >>I can see a difference between allowed maximum duration when there
> >>is something else pending, and when it isn't, but I don't
> >>immediately see that we should consider this distinction for any
> >>real benefit?
> >>
> >>So should the feature just be "maximum request duration"? This would
> >>perhaps make it just a special case of hangcheck, which ignores head
> >>progress, or whatever we do in there.
> >>
> >>Regards,
> >>
> >>Tvrtko
> >
> >I think you might be unclear about how this works. We're not starting a
> >preemption to see if we can cleanly remove a request who has begun to
> >exceed its normal time slice, i.e. hangcheck. This is about bounding
> >the time that a normal preemption can take. So first start preemption
> >in response to higher-priority request arrival, then wait for preemption
> >to complete within a certain amount of time. If it does not, resort to
> >reset.
> >
> >So it's really "force the resolution of a preemption", shortened to
> >"force preemption".
> 
> You are right, I veered off in my thinking and ended up with
> something different. :)
> 
> I however still think the name is potentially misleading, since the
> request/context is not getting preempted. It is getting effectively
> killed (sooner or later, directly or indirectly).
> 
> Maybe that is OK for the specific use case when everything is only
> broken and not malicious.
> 
> In a more general purpose system it would be a bit random when
> something would work, and when it wouldn't, depending on system
> setup and even timings.
> 
> Hm, maybe you don't even really benefit from the standard three
> strikes and you are out policy, and for this specific use case, you
> should just kill it straight away. If it couldn't be preempted once,
> why pay the penalty any more?
> 
> If you don't have it already, devising a solution which blacklists
> the process (if it creates more contexts), or even a parent (if
> forking is applicable and implementation feasible), for offenders
> could also be beneficial.
> 
> Regards,
> 
> Tvrtko

Fair enough. There wasn't a lot of deliberation on this name. We
referred to it in various ways during development. I think I started
using "force preemption" because it was short. "reset to preempt" was
another phrase that was used.

The handling of the guilty client/context could be tailored more. Like
I said it was easiest to start with the same sort of handling that we
have already for hang scenarios. Simple is good when you are rebasing
without much hope to upstream. :(

If there was interest in upstreaming this capability, we could certainly
incorporate nicely within a refactoring of hangcheck. And then we
wouldn't even need a special name for it. The whole thing could be recast
as time slice management, where your slice is condition-based. You get
unlimited time if no one wants the engine, X time if someone of equal
priority wants the engine, and Y time if someone of higher priority
wants the engine, etc. Where 'Y' is analogous to the fpreempt_timeout
value in my RFC.

Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 16:01         ` Jeff McGee
@ 2018-03-22 17:41           ` Tvrtko Ursulin
  2018-03-22 19:08             ` Jeff McGee
  0 siblings, 1 reply; 33+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22 17:41 UTC (permalink / raw)
  To: Jeff McGee; +Cc: kalyan.kondapally, intel-gfx, ben


On 22/03/2018 16:01, Jeff McGee wrote:
> On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
>>
>> On 22/03/2018 14:34, Jeff McGee wrote:
>>> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
>>>> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
>>>>>
>>>>> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
>>>>>> From: Jeff McGee <jeff.mcgee@intel.com>
>>>>>>
>>>>>> Force preemption uses engine reset to enforce a limit on the time
>>>>>> that a request targeted for preemption can block. This feature is
>>>>>> a requirement in automotive systems where the GPU may be shared by
>>>>>> clients of critically high priority and clients of low priority that
>>>>>> may not have been curated to be preemption friendly. There may be
>>>>>> more general applications of this feature. I'm sharing as an RFC to
>>>>>> stimulate that discussion and also to get any technical feedback
>>>>>> that I can before submitting to the product kernel that needs this.
>>>>>> I have developed the patches for ease of rebase, given that this is
>>>>>> for the moment considered a non-upstreamable feature. It would be
>>>>>> possible to refactor hangcheck to fully incorporate force preemption
>>>>>> as another tier of patience (or impatience) with the running request.
>>>>>
>>>>> Sorry if it was mentioned elsewhere and I missed it - but does this work
>>>>> only with stateless clients - or in other words, what would happen to
>>>>> stateful clients which would be force preempted? Or the answer is we
>>>>> don't care since they are misbehaving?
>>>>
>>>> They get notified of being guilty for causing a gpu reset; three strikes
>>>> and they are out (banned from using the gpu) using the current rules.
>>>> This is a very blunt hammer that requires the rest of the system to be
>>>> robust; one might argue time spent making the system robust would be
>>>> better served making sure that the timer never expired in the first place
>>>> thereby eliminating the need for a forced gpu reset.
>>>> -Chris
>>>
>>> Yes, for simplication the policy applied to force preempted contexts
>>> is the same as for hanging contexts. It is known that this feature
>>> should not be required in a fully curated system. It's a requirement
>>> if end user will be alllowed to install 3rd party apps to run in the
>>> non-critical domain.
>>
>> My concern is whether it safe to call this force _preemption_, while
>> it is not really expected to work as preemption from the point of
>> view of preempted context. I may be missing some angle here, but I
>> think a better name would include words like maximum request
>> duration or something.
>>
>> I can see a difference between allowed maximum duration when there
>> is something else pending, and when it isn't, but I don't
>> immediately see that we should consider this distinction for any
>> real benefit?
>>
>> So should the feature just be "maximum request duration"? This would
>> perhaps make it just a special case of hangcheck, which ignores head
>> progress, or whatever we do in there.
>>
>> Regards,
>>
>> Tvrtko
> 
> I think you might be unclear about how this works. We're not starting a
> preemption to see if we can cleanly remove a request who has begun to
> exceed its normal time slice, i.e. hangcheck. This is about bounding
> the time that a normal preemption can take. So first start preemption
> in response to higher-priority request arrival, then wait for preemption
> to complete within a certain amount of time. If it does not, resort to
> reset.
> 
> So it's really "force the resolution of a preemption", shortened to
> "force preemption".

You are right, I veered off in my thinking and ended up with something 
different. :)

I however still think the name is potentially misleading, since the 
request/context is not getting preempted. It is getting effectively 
killed (sooner or later, directly or indirectly).

Maybe that is OK for the specific use case when everything is only 
broken and not malicious.

In a more general purpose system it would be a bit random when something 
would work, and when it wouldn't, depending on system setup and even 
timings.

Hm, maybe you don't even really benefit from the standard three strikes 
and you are out policy, and for this specific use case, you should just 
kill it straight away. If it couldn't be preempted once, why pay the 
penalty any more?

If you don't have it already, devising a solution which blacklists the 
process (if it creates more contexts), or even a parent (if forking is 
applicable and implementation feasible), for offenders could also be 
beneficial.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 15:57       ` Tvrtko Ursulin
@ 2018-03-22 16:01         ` Jeff McGee
  2018-03-22 17:41           ` Tvrtko Ursulin
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff McGee @ 2018-03-22 16:01 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: kalyan.kondapally, intel-gfx, ben

On Thu, Mar 22, 2018 at 03:57:49PM +0000, Tvrtko Ursulin wrote:
> 
> On 22/03/2018 14:34, Jeff McGee wrote:
> >On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> >>Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> >>>
> >>>On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> >>>>From: Jeff McGee <jeff.mcgee@intel.com>
> >>>>
> >>>>Force preemption uses engine reset to enforce a limit on the time
> >>>>that a request targeted for preemption can block. This feature is
> >>>>a requirement in automotive systems where the GPU may be shared by
> >>>>clients of critically high priority and clients of low priority that
> >>>>may not have been curated to be preemption friendly. There may be
> >>>>more general applications of this feature. I'm sharing as an RFC to
> >>>>stimulate that discussion and also to get any technical feedback
> >>>>that I can before submitting to the product kernel that needs this.
> >>>>I have developed the patches for ease of rebase, given that this is
> >>>>for the moment considered a non-upstreamable feature. It would be
> >>>>possible to refactor hangcheck to fully incorporate force preemption
> >>>>as another tier of patience (or impatience) with the running request.
> >>>
> >>>Sorry if it was mentioned elsewhere and I missed it - but does this work
> >>>only with stateless clients - or in other words, what would happen to
> >>>stateful clients which would be force preempted? Or the answer is we
> >>>don't care since they are misbehaving?
> >>
> >>They get notified of being guilty for causing a gpu reset; three strikes
> >>and they are out (banned from using the gpu) using the current rules.
> >>This is a very blunt hammer that requires the rest of the system to be
> >>robust; one might argue time spent making the system robust would be
> >>better served making sure that the timer never expired in the first place
> >>thereby eliminating the need for a forced gpu reset.
> >>-Chris
> >
> >Yes, for simplication the policy applied to force preempted contexts
> >is the same as for hanging contexts. It is known that this feature
> >should not be required in a fully curated system. It's a requirement
> >if end user will be alllowed to install 3rd party apps to run in the
> >non-critical domain.
> 
> My concern is whether it safe to call this force _preemption_, while
> it is not really expected to work as preemption from the point of
> view of preempted context. I may be missing some angle here, but I
> think a better name would include words like maximum request
> duration or something.
> 
> I can see a difference between allowed maximum duration when there
> is something else pending, and when it isn't, but I don't
> immediately see that we should consider this distinction for any
> real benefit?
> 
> So should the feature just be "maximum request duration"? This would
> perhaps make it just a special case of hangcheck, which ignores head
> progress, or whatever we do in there.
> 
> Regards,
> 
> Tvrtko

I think you might be unclear about how this works. We're not starting a
preemption to see if we can cleanly remove a request who has begun to
exceed its normal time slice, i.e. hangcheck. This is about bounding
the time that a normal preemption can take. So first start preemption
in response to higher-priority request arrival, then wait for preemption
to complete within a certain amount of time. If it does not, resort to
reset.

So it's really "force the resolution of a preemption", shortened to
"force preemption".
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 14:34     ` Jeff McGee
  2018-03-22 15:35       ` Chris Wilson
@ 2018-03-22 15:57       ` Tvrtko Ursulin
  2018-03-22 16:01         ` Jeff McGee
  1 sibling, 1 reply; 33+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22 15:57 UTC (permalink / raw)
  To: Jeff McGee, Chris Wilson; +Cc: kalyan.kondapally, intel-gfx, ben


On 22/03/2018 14:34, Jeff McGee wrote:
> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
>>>
>>> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
>>>> From: Jeff McGee <jeff.mcgee@intel.com>
>>>>
>>>> Force preemption uses engine reset to enforce a limit on the time
>>>> that a request targeted for preemption can block. This feature is
>>>> a requirement in automotive systems where the GPU may be shared by
>>>> clients of critically high priority and clients of low priority that
>>>> may not have been curated to be preemption friendly. There may be
>>>> more general applications of this feature. I'm sharing as an RFC to
>>>> stimulate that discussion and also to get any technical feedback
>>>> that I can before submitting to the product kernel that needs this.
>>>> I have developed the patches for ease of rebase, given that this is
>>>> for the moment considered a non-upstreamable feature. It would be
>>>> possible to refactor hangcheck to fully incorporate force preemption
>>>> as another tier of patience (or impatience) with the running request.
>>>
>>> Sorry if it was mentioned elsewhere and I missed it - but does this work
>>> only with stateless clients - or in other words, what would happen to
>>> stateful clients which would be force preempted? Or the answer is we
>>> don't care since they are misbehaving?
>>
>> They get notified of being guilty for causing a gpu reset; three strikes
>> and they are out (banned from using the gpu) using the current rules.
>> This is a very blunt hammer that requires the rest of the system to be
>> robust; one might argue time spent making the system robust would be
>> better served making sure that the timer never expired in the first place
>> thereby eliminating the need for a forced gpu reset.
>> -Chris
> 
> Yes, for simplication the policy applied to force preempted contexts
> is the same as for hanging contexts. It is known that this feature
> should not be required in a fully curated system. It's a requirement
> if end user will be alllowed to install 3rd party apps to run in the
> non-critical domain.

My concern is whether it safe to call this force _preemption_, while it 
is not really expected to work as preemption from the point of view of 
preempted context. I may be missing some angle here, but I think a 
better name would include words like maximum request duration or something.

I can see a difference between allowed maximum duration when there is 
something else pending, and when it isn't, but I don't immediately see 
that we should consider this distinction for any real benefit?

So should the feature just be "maximum request duration"? This would 
perhaps make it just a special case of hangcheck, which ignores head 
progress, or whatever we do in there.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 15:35       ` Chris Wilson
@ 2018-03-22 15:44         ` Jeff McGee
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff McGee @ 2018-03-22 15:44 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, ben, kalyan.kondapally

On Thu, Mar 22, 2018 at 03:35:19PM +0000, Chris Wilson wrote:
> Quoting Jeff McGee (2018-03-22 14:34:58)
> > On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > > Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > > 
> > > > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > > 
> > > > > Force preemption uses engine reset to enforce a limit on the time
> > > > > that a request targeted for preemption can block. This feature is
> > > > > a requirement in automotive systems where the GPU may be shared by
> > > > > clients of critically high priority and clients of low priority that
> > > > > may not have been curated to be preemption friendly. There may be
> > > > > more general applications of this feature. I'm sharing as an RFC to
> > > > > stimulate that discussion and also to get any technical feedback
> > > > > that I can before submitting to the product kernel that needs this.
> > > > > I have developed the patches for ease of rebase, given that this is
> > > > > for the moment considered a non-upstreamable feature. It would be
> > > > > possible to refactor hangcheck to fully incorporate force preemption
> > > > > as another tier of patience (or impatience) with the running request.
> > > > 
> > > > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > > > only with stateless clients - or in other words, what would happen to 
> > > > stateful clients which would be force preempted? Or the answer is we 
> > > > don't care since they are misbehaving?
> > > 
> > > They get notified of being guilty for causing a gpu reset; three strikes
> > > and they are out (banned from using the gpu) using the current rules.
> > > This is a very blunt hammer that requires the rest of the system to be
> > > robust; one might argue time spent making the system robust would be
> > > better served making sure that the timer never expired in the first place
> > > thereby eliminating the need for a forced gpu reset.
> > > -Chris
> > 
> > Yes, for simplication the policy applied to force preempted contexts
> > is the same as for hanging contexts. It is known that this feature
> > should not be required in a fully curated system. It's a requirement
> > if end user will be alllowed to install 3rd party apps to run in the
> > non-critical domain.
> 
> Third party code is still mediated by our userspace drivers, or are you
> contemplating scenarios where they talk directly to ioctls? How hostile
> do we have to contend with, i.e. survive a gpu fork bomb?
> -Chris

User space driver has very little influence over the preemptibility of
the user space app. The most it can do is enable the finest granularity,
e.g. 3D object level. But app can still submit a single giant triangle
with uber pixelshader. Nothing overtly wrong with that, but it is not
mid-batch preemptible on our hardware.
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22 14:34     ` Jeff McGee
@ 2018-03-22 15:35       ` Chris Wilson
  2018-03-22 15:44         ` Jeff McGee
  2018-03-22 15:57       ` Tvrtko Ursulin
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2018-03-22 15:35 UTC (permalink / raw)
  To: Jeff McGee; +Cc: intel-gfx, ben, kalyan.kondapally

Quoting Jeff McGee (2018-03-22 14:34:58)
> On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > > 
> > > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > > 
> > > > Force preemption uses engine reset to enforce a limit on the time
> > > > that a request targeted for preemption can block. This feature is
> > > > a requirement in automotive systems where the GPU may be shared by
> > > > clients of critically high priority and clients of low priority that
> > > > may not have been curated to be preemption friendly. There may be
> > > > more general applications of this feature. I'm sharing as an RFC to
> > > > stimulate that discussion and also to get any technical feedback
> > > > that I can before submitting to the product kernel that needs this.
> > > > I have developed the patches for ease of rebase, given that this is
> > > > for the moment considered a non-upstreamable feature. It would be
> > > > possible to refactor hangcheck to fully incorporate force preemption
> > > > as another tier of patience (or impatience) with the running request.
> > > 
> > > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > > only with stateless clients - or in other words, what would happen to 
> > > stateful clients which would be force preempted? Or the answer is we 
> > > don't care since they are misbehaving?
> > 
> > They get notified of being guilty for causing a gpu reset; three strikes
> > and they are out (banned from using the gpu) using the current rules.
> > This is a very blunt hammer that requires the rest of the system to be
> > robust; one might argue time spent making the system robust would be
> > better served making sure that the timer never expired in the first place
> > thereby eliminating the need for a forced gpu reset.
> > -Chris
> 
> Yes, for simplication the policy applied to force preempted contexts
> is the same as for hanging contexts. It is known that this feature
> should not be required in a fully curated system. It's a requirement
> if end user will be alllowed to install 3rd party apps to run in the
> non-critical domain.

Third party code is still mediated by our userspace drivers, or are you
contemplating scenarios where they talk directly to ioctls? How hostile
do we have to contend with, i.e. survive a gpu fork bomb?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22  9:28   ` Chris Wilson
@ 2018-03-22 14:34     ` Jeff McGee
  2018-03-22 15:35       ` Chris Wilson
  2018-03-22 15:57       ` Tvrtko Ursulin
  0 siblings, 2 replies; 33+ messages in thread
From: Jeff McGee @ 2018-03-22 14:34 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, ben, kalyan.kondapally

On Thu, Mar 22, 2018 at 09:28:00AM +0000, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> > 
> > On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > 
> > > Force preemption uses engine reset to enforce a limit on the time
> > > that a request targeted for preemption can block. This feature is
> > > a requirement in automotive systems where the GPU may be shared by
> > > clients of critically high priority and clients of low priority that
> > > may not have been curated to be preemption friendly. There may be
> > > more general applications of this feature. I'm sharing as an RFC to
> > > stimulate that discussion and also to get any technical feedback
> > > that I can before submitting to the product kernel that needs this.
> > > I have developed the patches for ease of rebase, given that this is
> > > for the moment considered a non-upstreamable feature. It would be
> > > possible to refactor hangcheck to fully incorporate force preemption
> > > as another tier of patience (or impatience) with the running request.
> > 
> > Sorry if it was mentioned elsewhere and I missed it - but does this work 
> > only with stateless clients - or in other words, what would happen to 
> > stateful clients which would be force preempted? Or the answer is we 
> > don't care since they are misbehaving?
> 
> They get notified of being guilty for causing a gpu reset; three strikes
> and they are out (banned from using the gpu) using the current rules.
> This is a very blunt hammer that requires the rest of the system to be
> robust; one might argue time spent making the system robust would be
> better served making sure that the timer never expired in the first place
> thereby eliminating the need for a forced gpu reset.
> -Chris

Yes, for simplication the policy applied to force preempted contexts
is the same as for hanging contexts. It is known that this feature
should not be required in a fully curated system. It's a requirement
if end user will be alllowed to install 3rd party apps to run in the
non-critical domain.
-Jeff
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-22  9:22 ` Tvrtko Ursulin
@ 2018-03-22  9:28   ` Chris Wilson
  2018-03-22 14:34     ` Jeff McGee
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2018-03-22  9:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Tvrtko Ursulin (2018-03-22 09:22:55)
> 
> On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> > From: Jeff McGee <jeff.mcgee@intel.com>
> > 
> > Force preemption uses engine reset to enforce a limit on the time
> > that a request targeted for preemption can block. This feature is
> > a requirement in automotive systems where the GPU may be shared by
> > clients of critically high priority and clients of low priority that
> > may not have been curated to be preemption friendly. There may be
> > more general applications of this feature. I'm sharing as an RFC to
> > stimulate that discussion and also to get any technical feedback
> > that I can before submitting to the product kernel that needs this.
> > I have developed the patches for ease of rebase, given that this is
> > for the moment considered a non-upstreamable feature. It would be
> > possible to refactor hangcheck to fully incorporate force preemption
> > as another tier of patience (or impatience) with the running request.
> 
> Sorry if it was mentioned elsewhere and I missed it - but does this work 
> only with stateless clients - or in other words, what would happen to 
> stateful clients which would be force preempted? Or the answer is we 
> don't care since they are misbehaving?

They get notified of being guilty for causing a gpu reset; three strikes
and they are out (banned from using the gpu) using the current rules.
This is a very blunt hammer that requires the rest of the system to be
robust; one might argue time spent making the system robust would be
better served making sure that the timer never expired in the first place
thereby eliminating the need for a forced gpu reset.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-21 17:26 jeff.mcgee
@ 2018-03-22  9:22 ` Tvrtko Ursulin
  2018-03-22  9:28   ` Chris Wilson
  0 siblings, 1 reply; 33+ messages in thread
From: Tvrtko Ursulin @ 2018-03-22  9:22 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben


On 21/03/2018 17:26, jeff.mcgee@intel.com wrote:
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

Sorry if it was mentioned elsewhere and I missed it - but does this work 
only with stateless clients - or in other words, what would happen to 
stateful clients which would be force preempted? Or the answer is we 
don't care since they are misbehaving?

Maybe ban such contexts on first force preemption would make sense for 
this specific target environment?

Regards,

Tvrtko

> 
> Chris Wilson (5):
>    drm/i915/execlists: Refactor out complete_preempt_context()
>    drm/i915: Add control flags to i915_handle_error()
>    drm/i915: Move engine reset prepare/finish to backends
>    drm/i915: Split execlists/guc reset prepartions
>    drm/i915/execlists: Flush pending preemption events during reset
> 
> Jeff McGee (3):
>    drm/i915: Fix loop on CSB processing
>    drm/i915: Skip CSB processing on invalid CSB tail
>    drm/i915: Force preemption to complete via engine reset
> 
>   drivers/gpu/drm/i915/i915_debugfs.c              |   4 +-
>   drivers/gpu/drm/i915/i915_drv.c                  |  17 +-
>   drivers/gpu/drm/i915/i915_drv.h                  |  10 +-
>   drivers/gpu/drm/i915/i915_gem.c                  |  69 ++--
>   drivers/gpu/drm/i915/i915_gpu_error.h            |   3 +
>   drivers/gpu/drm/i915/i915_irq.c                  |  55 +--
>   drivers/gpu/drm/i915/i915_params.c               |   3 +
>   drivers/gpu/drm/i915/i915_params.h               |   1 +
>   drivers/gpu/drm/i915/i915_request.c              |   2 +-
>   drivers/gpu/drm/i915/intel_engine_cs.c           |  40 +++
>   drivers/gpu/drm/i915/intel_guc_submission.c      |  39 ++
>   drivers/gpu/drm/i915/intel_hangcheck.c           |   8 +-
>   drivers/gpu/drm/i915/intel_lrc.c                 | 436 ++++++++++++++---------
>   drivers/gpu/drm/i915/intel_ringbuffer.c          |  20 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.h          |  13 +-
>   drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  13 +-
>   16 files changed, 469 insertions(+), 264 deletions(-)
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 0/8] Force preemption
@ 2018-03-21 17:26 jeff.mcgee
  2018-03-22  9:22 ` Tvrtko Ursulin
  0 siblings, 1 reply; 33+ messages in thread
From: jeff.mcgee @ 2018-03-21 17:26 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Force preemption uses engine reset to enforce a limit on the time
that a request targeted for preemption can block. This feature is
a requirement in automotive systems where the GPU may be shared by
clients of critically high priority and clients of low priority that
may not have been curated to be preemption friendly. There may be
more general applications of this feature. I'm sharing as an RFC to
stimulate that discussion and also to get any technical feedback
that I can before submitting to the product kernel that needs this.
I have developed the patches for ease of rebase, given that this is
for the moment considered a non-upstreamable feature. It would be
possible to refactor hangcheck to fully incorporate force preemption
as another tier of patience (or impatience) with the running request.

Chris Wilson (5):
  drm/i915/execlists: Refactor out complete_preempt_context()
  drm/i915: Add control flags to i915_handle_error()
  drm/i915: Move engine reset prepare/finish to backends
  drm/i915: Split execlists/guc reset prepartions
  drm/i915/execlists: Flush pending preemption events during reset

Jeff McGee (3):
  drm/i915: Fix loop on CSB processing
  drm/i915: Skip CSB processing on invalid CSB tail
  drm/i915: Force preemption to complete via engine reset

 drivers/gpu/drm/i915/i915_debugfs.c              |   4 +-
 drivers/gpu/drm/i915/i915_drv.c                  |  17 +-
 drivers/gpu/drm/i915/i915_drv.h                  |  10 +-
 drivers/gpu/drm/i915/i915_gem.c                  |  69 ++--
 drivers/gpu/drm/i915/i915_gpu_error.h            |   3 +
 drivers/gpu/drm/i915/i915_irq.c                  |  55 +--
 drivers/gpu/drm/i915/i915_params.c               |   3 +
 drivers/gpu/drm/i915/i915_params.h               |   1 +
 drivers/gpu/drm/i915/i915_request.c              |   2 +-
 drivers/gpu/drm/i915/intel_engine_cs.c           |  40 +++
 drivers/gpu/drm/i915/intel_guc_submission.c      |  39 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |   8 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 436 ++++++++++++++---------
 drivers/gpu/drm/i915/intel_ringbuffer.c          |  20 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |  13 +-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  13 +-
 16 files changed, 469 insertions(+), 264 deletions(-)

-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-03-23 13:37 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
2018-03-16 20:30   ` Chris Wilson
2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
2018-03-16 20:30   ` Chris Wilson
2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
2018-03-16 20:33   ` Chris Wilson
2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
2018-03-16 20:39   ` Chris Wilson
2018-03-16 20:44   ` Chris Wilson
2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
2018-03-16 21:03   ` Chris Wilson
2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson
2018-03-21 17:26 jeff.mcgee
2018-03-22  9:22 ` Tvrtko Ursulin
2018-03-22  9:28   ` Chris Wilson
2018-03-22 14:34     ` Jeff McGee
2018-03-22 15:35       ` Chris Wilson
2018-03-22 15:44         ` Jeff McGee
2018-03-22 15:57       ` Tvrtko Ursulin
2018-03-22 16:01         ` Jeff McGee
2018-03-22 17:41           ` Tvrtko Ursulin
2018-03-22 19:08             ` Jeff McGee
2018-03-22 19:59               ` Bloomfield, Jon
2018-03-23 13:20                 ` Joonas Lahtinen
2018-03-23 13:37                   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.