All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Force preemption
@ 2018-03-16 18:30 jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Force preemption uses engine reset to enforce a limit on the time
that a request targeted for preemption can block. This feature is
a requirement in automotive systems where the GPU may be shared by
clients of critically high priority and clients of low priority that
may not have been curated to be preemption friendly. There may be
more general applications of this feature. I'm sharing as an RFC to
stimulate that discussion and also to get any technical feedback
that I can before submitting to the product kernel that needs this.
I have developed the patches for ease of rebase, given that this is
for the moment considered a non-upstreamable feature. It would be
possible to refactor hangcheck to fully incorporate force preemption
as another tier of patience (or impatience) with the running request.

Jeff McGee (8):
  drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  drm/i915: Skip CSB processing on invalid CSB tail
  drm/i915: Execlists to mark the HWSP upon preemption finished
  drm/i915: Add a wait_for routine with more exact timeout
  drm/i915: Consider preemption when finding the active request
  drm/i915: Repair the preemption context if hit by reset
  drm/i915: Allow reset without error capture
  drm/i915: Force preemption to complete via engine reset

 drivers/gpu/drm/i915/i915_drv.c         | 14 +++++-
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++
 drivers/gpu/drm/i915/i915_gem.c         | 86 ++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_irq.c         | 75 ++++++++++++++++------------
 drivers/gpu/drm/i915/i915_params.c      |  3 ++
 drivers/gpu/drm/i915/i915_params.h      |  1 +
 drivers/gpu/drm/i915/intel_drv.h        | 22 +++++++++
 drivers/gpu/drm/i915/intel_engine_cs.c  | 53 ++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 81 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h | 20 +++++++-
 10 files changed, 314 insertions(+), 45 deletions(-)

-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
@ 2018-03-16 18:30 ` jeff.mcgee
  2018-03-16 20:30   ` Chris Wilson
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

It is possible for the hardware to be reset just as a context is
completing. The reset post-processing may see the seqno update and
assume that the context escaped corruption, but the context may
have been disrupted in the save out process. The corruption may
screw up the HEAD and TAIL pointers such that the next submission
of the context switches out without running the intended request.
The GEM_BUG_ON will be hit in this situation, but it is not really
a driver error. So make it a GEM_WARN_ON so that we notice while
letting hangcheck detect and clean up.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly with
      CONFIG_DRM_I915_DEBUG_GEM.
Change-Id: I87da4b16bad805fe48153a9ed9169900681ebba7
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b44861459d24..7d93fcd56d34 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -920,7 +920,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 			GEM_BUG_ON(count == 0);
 			if (--count == 0) {
 				GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
-				GEM_BUG_ON(!i915_gem_request_completed(rq));
+				GEM_WARN_ON(!i915_gem_request_completed(rq));
 				execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
 
 				trace_i915_gem_request_out(rq);
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
@ 2018-03-16 18:30 ` jeff.mcgee
  2018-03-16 20:30   ` Chris Wilson
  2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Engine reset is fast. A context switch interrupt may be generated just
prior to the reset such that the top half handler is racing with reset
post-processing. The handler may set the irq_posted bit again after
the reset code has cleared it to start fresh. Then the re-enabled
tasklet will read the CSB head and tail from MMIO, which will be at
the hardware reset values of 0 and 7 respectively, given that no
actual CSB event has occurred since the reset. Mayhem then ensues as
the tasklet starts processing invalid CSB entries.

We can handle this corner case without adding any new synchronization
between the irq top half and the reset work item. The tasklet can
just skip CSB processing if the tail is not sane.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: Ic7eb00600480bd62331c397dd92b747d946241e4
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7d93fcd56d34..5f63d1d6a2d6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -851,6 +851,14 @@ static void intel_lrc_irq_handler(unsigned long data)
 			head = readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 			tail = GEN8_CSB_WRITE_PTR(head);
 			head = GEN8_CSB_READ_PTR(head);
+
+			/* The MMIO read CSB tail may be at the reset value of
+			 * 0x7 if there hasn't been a valid CSB event since
+			 * the engine reset. Skip on to dequeue if so.
+			 */
+			if (tail >= GEN8_CSB_ENTRIES)
+				break;
+
 			execlists->csb_head = head;
 		} else {
 			const int write_idx =
@@ -859,6 +867,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 
 			head = execlists->csb_head;
 			tail = READ_ONCE(buf[write_idx]);
+			GEM_BUG_ON(tail >= GEN8_CSB_ENTRIES);
 		}
 
 		while (head != tail) {
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

In the next patch we are improving how we find the active request
on an engine with a pending preemption. We need to know if the
preemption has completed or not. This determination can be made
most robustly by having the preemption context write a preemption
finished indicator to the hardware status page.

This patch is required to support the force preemption feature.

Change-Id: I4123283aec02e21d13c7cb55f329cf3f553b5d2c
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 27 ++++++++++++++++++++++++---
 drivers/gpu/drm/i915/intel_ringbuffer.h | 14 +++++++++++++-
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 5f63d1d6a2d6..b2f838c484b0 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -512,19 +512,37 @@ static void port_assign(struct execlist_port *port,
 	port_set(port, port_pack(i915_gem_request_get(rq), port_count(port)));
 }
 
+#define PREEMPT_BREADCRUMB_DWORDS 0x8
 static void inject_preempt_context(struct intel_engine_cs *engine)
 {
 	struct intel_context *ce =
 		&engine->i915->preempt_context->engine[engine->id];
+	u32 *cs = ce->ring->vaddr + ce->ring->tail;
 	u32 __iomem *elsp =
 		engine->i915->regs + i915_mmio_reg_offset(RING_ELSP(engine));
 	unsigned int n;
 
 	GEM_BUG_ON(engine->i915->preempt_context->hw_id != PREEMPT_ID);
-	GEM_BUG_ON(!IS_ALIGNED(ce->ring->size, WA_TAIL_BYTES));
+	GEM_BUG_ON(intel_engine_preempt_finished(engine));
 
-	memset(ce->ring->vaddr + ce->ring->tail, 0, WA_TAIL_BYTES);
-	ce->ring->tail += WA_TAIL_BYTES;
+	if (engine->id == RCS) {
+		cs = gen8_emit_ggtt_write_rcs(cs, I915_GEM_HWS_PREEMPT_FINISHED,
+				intel_hws_preempt_done_address(engine));
+	} else {
+		cs = gen8_emit_ggtt_write(cs, I915_GEM_HWS_PREEMPT_FINISHED,
+				intel_hws_preempt_done_address(engine));
+		*cs++ = MI_NOOP;
+		*cs++ = MI_NOOP;
+	}
+	*cs++ = MI_NOOP;
+	*cs++ = MI_NOOP;
+
+	GEM_BUG_ON(!IS_ALIGNED(ce->ring->size,
+			       PREEMPT_BREADCRUMB_DWORDS * sizeof(u32)));
+	GEM_BUG_ON((void *)cs - (ce->ring->vaddr + ce->ring->tail) !=
+		   PREEMPT_BREADCRUMB_DWORDS * sizeof(u32));
+
+	ce->ring->tail += PREEMPT_BREADCRUMB_DWORDS * sizeof(u32);
 	ce->ring->tail &= (ce->ring->size - 1);
 	ce->lrc_reg_state[CTX_RING_TAIL+1] = ce->ring->tail;
 
@@ -911,6 +929,8 @@ static void intel_lrc_irq_handler(unsigned long data)
 								EXECLISTS_ACTIVE_PREEMPT));
 				execlists_clear_active(execlists,
 						       EXECLISTS_ACTIVE_PREEMPT);
+				GEM_BUG_ON(!intel_engine_preempt_finished(engine));
+				intel_engine_clear_preempt(engine);
 				continue;
 			}
 
@@ -1507,6 +1527,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	execlists->csb_head = -1;
 	execlists->active = 0;
+	intel_engine_clear_preempt(engine);
 
 	/* After a GPU reset, we may have requests to replay */
 	if (!i915_modparams.enable_guc_submission && execlists->first)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 5b9117b3cba4..25eb23bc06eb 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -625,11 +625,12 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
 #define I915_GEM_HWS_INDEX_ADDR (I915_GEM_HWS_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 #define I915_GEM_HWS_PREEMPT_INDEX	0x32
 #define I915_GEM_HWS_PREEMPT_ADDR (I915_GEM_HWS_PREEMPT_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
+#define I915_GEM_HWS_PREEMPT_FINISHED	1
 #define I915_GEM_HWS_PID_INDEX 0x40
 #define I915_GEM_HWS_PID_ADDR (I915_GEM_HWS_PID_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 #define I915_GEM_HWS_CID_INDEX 0x48
 #define I915_GEM_HWS_CID_ADDR (I915_GEM_HWS_CID_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
-#define I915_GEM_HWS_SCRATCH_INDEX     0x50
+#define I915_GEM_HWS_SCRATCH_INDEX	0x58
 #define I915_GEM_HWS_SCRATCH_ADDR (I915_GEM_HWS_SCRATCH_INDEX << MI_STORE_DWORD_INDEX_SHIFT)
 
 #define I915_HWS_CSB_BUF0_INDEX		0x10
@@ -749,6 +750,17 @@ static inline u32 intel_engine_get_seqno(struct intel_engine_cs *engine)
 	return intel_read_status_page(engine, I915_GEM_HWS_INDEX);
 }
 
+static inline bool intel_engine_preempt_finished(struct intel_engine_cs *engine)
+{
+	return (intel_read_status_page(engine, I915_GEM_HWS_PREEMPT_INDEX) ==
+		I915_GEM_HWS_PREEMPT_FINISHED);
+}
+
+static inline void intel_engine_clear_preempt(struct intel_engine_cs *engine)
+{
+	intel_write_status_page(engine, I915_GEM_HWS_PREEMPT_INDEX, 0);
+}
+
 static inline u32 intel_engine_last_submit(struct intel_engine_cs *engine)
 {
 	/* We are only peeking at the tail of the submit queue (and not the
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (2 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The current non-atomic wait_for routines convert the provided usec
timeout into jiffies with upward rounding. This can increase the
actual timeout several msecs which is fine in most cases. In the next
patch we need the timeout to conform more exactly to what is
requested.

This patch is required to support the force preemption feature.

Change-Id: I20372ccd4db6609c42a6b41000ca7ff7700358fc
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_drv.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index fa800918f67d..7cb6619129c3 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -75,6 +75,28 @@
 						   (Wmax))
 #define wait_for(COND, MS)		_wait_for((COND), (MS) * 1000, 10, 1000)
 
+#define _wait_for_exact(COND, US, Wmin, Wmax) ({ \
+	u64 timeout__ = local_clock() + (US) * 1000;			\
+	long wait__ = (Wmin); /* recommended min for usleep is 10 us */	\
+	int ret__;							\
+	might_sleep();							\
+	for (;;) {							\
+		bool expired__ = (local_clock() > timeout__);		\
+		if (COND) {						\
+			ret__ = 0;					\
+			break;						\
+		}							\
+		if (expired__) {					\
+			ret__ = -ETIMEDOUT;				\
+			break;						\
+		}							\
+		usleep_range(wait__, wait__ * 2);			\
+		if (wait__ < (Wmax))					\
+			wait__ <<= 1;					\
+	}								\
+	ret__;								\
+})
+
 /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
 #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
 # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) WARN_ON_ONCE((ATOMIC) && !in_atomic())
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 5/8] drm/i915: Consider preemption when finding the active request
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (3 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The active request is found by scanning the engine timeline for the
request that follows the last completed request. That method is accurate
if there is no preemption in progress, because the engine will certainly
have started that request. If there is a preemption in progress, it could
have completed leaving the engine idle. In this case the request we
identified with the above method is not active and may not even have been
started. We must check for this condition to avoid fingering an innocent
request during reset.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: I63a9f64446e24d4ee36b4af32854699bda006ddd
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 49 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e2961e3913b8..9780d9026ce6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2739,9 +2739,9 @@ static void i915_gem_context_mark_innocent(struct i915_gem_context *ctx)
 }
 
 struct drm_i915_gem_request *
-i915_gem_find_active_request(struct intel_engine_cs *engine)
+i915_gem_find_pending_request(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request, *active = NULL;
+	struct drm_i915_gem_request *request, *pending = NULL;
 	unsigned long flags;
 
 	/* We are called by the error capture and reset at a random
@@ -2762,12 +2762,53 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 		GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
 				    &request->fence.flags));
 
-		active = request;
+		pending = request;
 		break;
 	}
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 
-	return active;
+	return pending;
+}
+
+struct drm_i915_gem_request *
+i915_gem_find_active_request(struct intel_engine_cs *engine)
+{
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+	struct drm_i915_gem_request *request;
+
+	/* The pending request is active if no preemption is in progress */
+	if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT))
+		return i915_gem_find_pending_request(engine);
+
+	/* Preemption has finished. Engine idle. */
+	if (intel_engine_preempt_finished(engine))
+		return NULL;
+
+	request = i915_gem_find_pending_request(engine);
+
+	/* Preemption has flushed all requests off the engine. Engine idle. */
+	if (!request)
+		return NULL;
+
+	/* The pending request likely is active and blocking the in-progress
+	 * preemption. But there is a race in our previous checks. The request
+	 * that was actually blocking preemption could have completed (a batch-
+	 * boundary preemption) such that the engine is idle and the pending
+	 * request we have identified was the next in line. We must wait for
+	 * at least as long as it would take for the preempt-to-idle context
+	 * to mark the preemption done to verify this. We use 500 usecs to
+	 * account for a worst case delay from the seqno write of the
+	 * completing request and the preempt finished write.
+	 */
+	if (!_wait_for_exact(intel_engine_preempt_finished(engine), 500, 10, 50))
+		return NULL;
+
+	/*
+	 * We didn't see the preemption done after a sufficient wait. Thus the
+	 * pending request we sampled above was in fact active and blocking
+	 * the preemption.
+	 */
+	return request;
 }
 
 static bool engine_stalled(struct intel_engine_cs *engine)
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 6/8] drm/i915: Repair the preemption context if hit by reset
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (4 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

It is possible for the preemption context to be active on an engine
when the engine is reset. The preemption context is not tracked via
the request mechanism, so the normal reset handling will not detect
this scenario and will not be able to fixup possible context
corruption. So add some extra logic to do this.

This patch is required to support the force preemption feature.

Test: Run IGT gem_exec_fpreempt repeatedly.
Change-Id: Ifd0f17726111f3b702dd900064b7f375bbb42808
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b2f838c484b0..581483886153 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1619,6 +1619,25 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 
+	/* If a preemption was pending when the reset occurred, and no
+	 * active request was found when the reset completed, it is
+	 * possible that the preemption context was hit by the reset.
+	 * We must assume that the context is corrupted so repair it.
+	 */
+	if (execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT) &&
+	    !request) {
+		struct i915_gem_context *ctx = engine->i915->preempt_context;
+		ce = &ctx->engine[engine->id];
+
+		execlists_init_reg_state(ce->lrc_reg_state,
+					 ctx, engine, ce->ring);
+		ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
+			i915_ggtt_offset(ce->ring->vma);
+		ce->lrc_reg_state[CTX_RING_HEAD+1] = ce->ring->tail;
+
+		return;
+	}
+
 	/* If the request was innocent, we leave the request in the ELSP
 	 * and will try to replay it on restarting. The context image may
 	 * have been corrupted by the reset, in which case we may have
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 7/8] drm/i915: Allow reset without error capture
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (5 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 20:33   ` Chris Wilson
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

Pull the reset handling out of i915_handle_error() so that it can be
called by that function and directly by the upcoming force preemption
handler. This allows the force preemption handler to bypass the error
capture that i915_handle_error() does before getting on with the
reset. We do not want error capture for force preemption because it
adds significant latency (~10 msecs measured on APL).

This patch is required to support the force preemption feature.

Change-Id: I41b4fae1adc197f0e70cec47cb960a0d7fa55f48
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |  2 ++
 drivers/gpu/drm/i915/i915_irq.c | 75 +++++++++++++++++++++++------------------
 2 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d8524357373e..ade09f97be5c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3245,6 +3245,8 @@ __printf(3, 4)
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
 		       const char *fmt, ...);
+void i915_handle_reset(struct drm_i915_private *dev_priv,
+		       u32 engine_mask);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a34f459f8ac1..ab5d4d40083d 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2673,41 +2673,17 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
 }
 
 /**
- * i915_handle_error - handle a gpu error
+ * i915_handle_reset - handle a gpu reset
  * @dev_priv: i915 device private
- * @engine_mask: mask representing engines that are hung
- * @fmt: Error message format string
+ * @engine_mask: mask representing engines that require reset
  *
- * Do some basic checking of register state at error time and
- * dump it to the syslog.  Also call i915_capture_error_state() to make
- * sure we get a record and make it available in debugfs.  Fire a uevent
- * so userspace knows something bad happened (should trigger collection
- * of a ring dump etc.).
+ * Executes reset on the given engines.
  */
-void i915_handle_error(struct drm_i915_private *dev_priv,
-		       u32 engine_mask,
-		       const char *fmt, ...)
+void i915_handle_reset(struct drm_i915_private *dev_priv,
+		       u32 engine_mask)
 {
 	struct intel_engine_cs *engine;
 	unsigned int tmp;
-	va_list args;
-	char error_msg[80];
-
-	va_start(args, fmt);
-	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-	va_end(args);
-
-	/*
-	 * In most cases it's guaranteed that we get here with an RPM
-	 * reference held, for example because there is a pending GPU
-	 * request that won't finish until the reset is done. This
-	 * isn't the case at least when we get here by doing a
-	 * simulated reset via debugfs, so get an RPM reference.
-	 */
-	intel_runtime_pm_get(dev_priv);
-
-	i915_capture_error_state(dev_priv, engine_mask, error_msg);
-	i915_clear_error_registers(dev_priv);
 
 	/*
 	 * Try engine reset when available. We fall back to full reset if
@@ -2731,14 +2707,14 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	}
 
 	if (!engine_mask)
-		goto out;
+		return;
 
 	/* Full reset needs the mutex, stop any other user trying to do so. */
 	if (test_and_set_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags)) {
 		wait_event(dev_priv->gpu_error.reset_queue,
 			   !test_bit(I915_RESET_BACKOFF,
 				     &dev_priv->gpu_error.flags));
-		goto out;
+		return;
 	}
 
 	/* Prevent any other reset-engine attempt. */
@@ -2759,8 +2735,43 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 
 	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
 	wake_up_all(&dev_priv->gpu_error.reset_queue);
+}
+/**
+ * i915_handle_error - handle a gpu error
+ * @dev_priv: i915 device private
+ * @engine_mask: mask representing engines that are hung
+ * @fmt: Error message format string
+ *
+ * Do some basic checking of register state at error time and
+ * dump it to the syslog.  Also call i915_capture_error_state() to make
+ * sure we get a record and make it available in debugfs.  Fire a uevent
+ * so userspace knows something bad happened (should trigger collection
+ * of a ring dump etc.).
+ */
+void i915_handle_error(struct drm_i915_private *dev_priv,
+		       u32 engine_mask,
+		       const char *fmt, ...)
+{
+	va_list args;
+	char error_msg[80];
+
+	va_start(args, fmt);
+	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+	va_end(args);
+
+	/*
+	 * In most cases it's guaranteed that we get here with an RPM
+	 * reference held, for example because there is a pending GPU
+	 * request that won't finish until the reset is done. This
+	 * isn't the case at least when we get here by doing a
+	 * simulated reset via debugfs, so get an RPM reference.
+	 */
+	intel_runtime_pm_get(dev_priv);
+
+	i915_capture_error_state(dev_priv, engine_mask, error_msg);
+	i915_clear_error_registers(dev_priv);
+	i915_handle_reset(dev_priv, engine_mask);
 
-out:
 	intel_runtime_pm_put(dev_priv);
 }
 
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (6 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
@ 2018-03-16 18:31 ` jeff.mcgee
  2018-03-16 20:39   ` Chris Wilson
  2018-03-16 20:44   ` Chris Wilson
  2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 20+ messages in thread
From: jeff.mcgee @ 2018-03-16 18:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: ben, kalyan.kondapally

From: Jeff McGee <jeff.mcgee@intel.com>

The hardware can complete the requested preemption at only certain points
in execution. Thus an uncooperative request that avoids those points can
block a preemption indefinitely. Our only option to bound the preemption
latency is to trigger reset and recovery just as we would if a request had
hung the hardware. This is so-called forced preemption. This change adds
that capability as an option for systems with extremely strict scheduling
latency requirements for its high priority requests. This option must be
used with great care. The force-preempted requests will be discarded at
the point of reset, resulting in various degrees of disruption to the
owning application up to and including crash.

The option is enabled by specifying a non-zero value for new i915 module
parameter fpreempt_timeout. This value becomes the time in milliseconds
after initiation of preemption at which the reset is triggered if the
preemption has not completed normally.

Test: Run IGT gem_exec_fpreempt.
Change-Id: Iafd3609012621c57fa9e490dfeeac46ae541b5c2
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 14 ++++++++-
 drivers/gpu/drm/i915/i915_drv.h         |  2 ++
 drivers/gpu/drm/i915/i915_gem.c         | 37 +++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_params.c      |  3 ++
 drivers/gpu/drm/i915/i915_params.h      |  1 +
 drivers/gpu/drm/i915/intel_engine_cs.c  | 53 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++++
 8 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b627370d5a9c..3f2394d61ea2 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -811,8 +811,16 @@ static int i915_workqueues_init(struct drm_i915_private *dev_priv)
 	if (dev_priv->hotplug.dp_wq == NULL)
 		goto out_free_wq;
 
+	if (INTEL_INFO(dev_priv)->has_logical_ring_preemption) {
+		dev_priv->fpreempt_wq = alloc_ordered_workqueue("i915-fpreempt",
+								WQ_HIGHPRI);
+		if (dev_priv->fpreempt_wq == NULL)
+			goto out_free_dp_wq;
+	}
 	return 0;
 
+out_free_dp_wq:
+	destroy_workqueue(dev_priv->hotplug.dp_wq);
 out_free_wq:
 	destroy_workqueue(dev_priv->wq);
 out_err:
@@ -832,6 +840,8 @@ static void i915_engines_cleanup(struct drm_i915_private *i915)
 
 static void i915_workqueues_cleanup(struct drm_i915_private *dev_priv)
 {
+	if (INTEL_INFO(dev_priv)->has_logical_ring_preemption)
+		destroy_workqueue(dev_priv->fpreempt_wq);
 	destroy_workqueue(dev_priv->hotplug.dp_wq);
 	destroy_workqueue(dev_priv->wq);
 }
@@ -2007,7 +2017,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
 
 	if (!(flags & I915_RESET_QUIET)) {
 		dev_notice(engine->i915->drm.dev,
-			   "Resetting %s after gpu hang\n", engine->name);
+			   "Resetting %s %s\n", engine->name,
+			   engine->fpreempt_active ?
+			   "for force preemption" : "after gpu hang");
 	}
 	error->reset_engine_count[engine->id]++;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ade09f97be5c..514e640d8406 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2290,6 +2290,8 @@ struct drm_i915_private {
 	 */
 	struct workqueue_struct *wq;
 
+	struct workqueue_struct *fpreempt_wq;
+
 	/* Display functions */
 	struct drm_i915_display_funcs display;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9780d9026ce6..d556743c578a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2811,9 +2811,21 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return request;
 }
 
-static bool engine_stalled(struct intel_engine_cs *engine)
+static bool engine_stalled(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
 	if (!intel_vgpu_active(engine->i915)) {
+		if (engine->fpreempt_active) {
+			/* Pardon the request if it managed to complete or
+			 * preempt prior to the reset.
+			 */
+			if (i915_gem_request_completed(request) ||
+			    intel_engine_preempt_finished(engine))
+				return false;
+
+			return true;
+		}
+
 		if (!engine->hangcheck.stalled)
 			return false;
 
@@ -2858,6 +2870,13 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	tasklet_kill(&engine->execlists.irq_tasklet);
 	tasklet_disable(&engine->execlists.irq_tasklet);
 
+	/* There may be a force preemption timer active on this engine but
+	 * not yet expired, i.e. not the reason we are about to reset this
+	 * engine. Cancel it. If force preemption timeout is the reason we
+	 * are resetting the engine, this call will have no efffect.
+	 */
+	intel_engine_cancel_fpreempt(engine);
+
 	if (engine->irq_seqno_barrier)
 		engine->irq_seqno_barrier(engine);
 
@@ -2958,7 +2977,7 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 	 * subsequent hangs.
 	 */
 
-	if (engine_stalled(engine)) {
+	if (engine_stalled(engine, request)) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
 
@@ -2966,6 +2985,13 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
 		if (i915_gem_context_is_banned(request->ctx))
 			engine_skip_context(request);
 	} else {
+		/* If the request that we just pardoned was the target of a
+		 * force preemption there is no possibility of the next
+		 * request in line having started.
+		 */
+		if (engine->fpreempt_active)
+			return NULL;
+
 		/*
 		 * Since this is not the hung engine, it may have advanced
 		 * since the hang declaration. Double check by refinding
@@ -3035,6 +3061,13 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
 {
+	/* Mark any active force preemption as complete then kick
+	 * the tasklet.
+	 */
+	engine->fpreempt_active = false;
+	if (engine->execlists.first)
+		tasklet_schedule(&engine->execlists.irq_tasklet);
+
 	tasklet_enable(&engine->execlists.irq_tasklet);
 	kthread_unpark(engine->breadcrumbs.signaler);
 }
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 63751a1de74a..9a0deb6e3920 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -247,3 +247,6 @@ MODULE_PARM_DESC(enable_conformance_check, "To toggle the GVT guest conformance
 
 module_param_named(disable_gvt_fw_loading, i915_modparams.disable_gvt_fw_loading, bool, 0400);
 MODULE_PARM_DESC(disable_gvt_fw_loading, "Disable GVT-g fw loading.");
+
+i915_param_named(fpreempt_timeout, uint, 0600,
+	"Wait time in msecs before forcing a preemption with reset (0:never force [default])");
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index e8c2ba4cb1e6..65fbffa6333c 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -54,6 +54,7 @@
 	func(int, edp_vswing); \
 	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
+	func(unsigned int, fpreempt_timeout); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
 	func(bool, enable_cmd_parser); \
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index c9bf2347f7a4..af6cc2d0f7e9 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -411,6 +411,58 @@ static void intel_engine_init_execlist(struct intel_engine_cs *engine)
 	execlists->first = NULL;
 }
 
+void intel_engine_queue_fpreempt(struct intel_engine_cs *engine)
+{
+	unsigned int timeout = i915_modparams.fpreempt_timeout;
+
+	if (!timeout)
+		return;
+
+	GEM_BUG_ON(engine->fpreempt_active);
+	hrtimer_start(&engine->fpreempt_timer,
+		      ms_to_ktime(timeout), HRTIMER_MODE_REL);
+}
+
+bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
+{
+	hrtimer_cancel(&engine->fpreempt_timer);
+
+	return !engine->fpreempt_active;
+}
+
+static enum hrtimer_restart
+intel_engine_fpreempt_timer(struct hrtimer *hrtimer)
+{
+	struct intel_engine_cs *engine =
+		container_of(hrtimer, struct intel_engine_cs,
+			     fpreempt_timer);
+
+	engine->fpreempt_active = true;
+	queue_work(engine->i915->fpreempt_wq, &engine->fpreempt_work);
+
+	return HRTIMER_NORESTART;
+}
+
+static void intel_engine_fpreempt_work(struct work_struct *work)
+{
+	struct intel_engine_cs *engine =
+		container_of(work, struct intel_engine_cs,
+			     fpreempt_work);
+
+	i915_handle_reset(engine->i915, intel_engine_flag(engine));
+}
+
+static void intel_engine_init_fpreempt(struct intel_engine_cs *engine)
+{
+	if (!INTEL_INFO(engine->i915)->has_logical_ring_preemption)
+		return;
+
+	hrtimer_init(&engine->fpreempt_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	engine->fpreempt_timer.function = intel_engine_fpreempt_timer;
+	INIT_WORK(&engine->fpreempt_work, intel_engine_fpreempt_work);
+}
+
 /**
  * intel_engines_setup_common - setup engine state not requiring hw access
  * @engine: Engine to setup.
@@ -426,6 +478,7 @@ void intel_engine_setup_common(struct intel_engine_cs *engine)
 
 	intel_engine_init_timeline(engine);
 	intel_engine_init_hangcheck(engine);
+	intel_engine_init_fpreempt(engine);
 	i915_gem_batch_pool_init(engine, &engine->batch_pool);
 
 	intel_engine_init_cmd_parser(engine);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 581483886153..17487f8e8b4c 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -628,6 +628,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 			inject_preempt_context(engine);
 			execlists_set_active(execlists,
 					     EXECLISTS_ACTIVE_PREEMPT);
+			intel_engine_queue_fpreempt(engine);
 			goto unlock;
 		} else {
 			/*
@@ -846,6 +847,7 @@ static void intel_lrc_irq_handler(unsigned long data)
 		const u32 *buf =
 			&engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
 		unsigned int head, tail;
+		bool defer = false;
 
 		/* However GVT emulation depends upon intercepting CSB mmio */
 		if (unlikely(execlists->csb_use_mmio)) {
@@ -919,6 +921,21 @@ static void intel_lrc_irq_handler(unsigned long data)
 
 			if (status & GEN8_CTX_STATUS_ACTIVE_IDLE &&
 			    buf[2*head + 1] == PREEMPT_ID) {
+				/* Try to cancel any pending force preemption.
+				 * If we are too late, hold off on processing
+				 * the completed preemption until reset has
+				 * run its course. It should recognize that
+				 * the engine has preempted to idle then abort
+				 * the reset. Then we can resume processing
+				 * at this CSB head.
+				 */
+				if (!intel_engine_cancel_fpreempt(engine)) {
+					if (!head--)
+						head = GEN8_CSB_ENTRIES - 1;
+					defer = true;
+					break;
+				}
+
 				execlist_cancel_port_requests(execlists);
 
 				spin_lock_irq(&engine->timeline->lock);
@@ -973,11 +990,16 @@ static void intel_lrc_irq_handler(unsigned long data)
 			writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
 			       dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 		}
+
+		if (defer) {
+			__set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
+			goto out;
+		}
 	}
 
 	if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT))
 		execlists_dequeue(engine);
-
+out:
 	intel_uncore_forcewake_put(dev_priv, execlists->fw_domains);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 25eb23bc06eb..e259f954cdea 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -506,6 +506,10 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	struct hrtimer fpreempt_timer;
+	struct work_struct fpreempt_work;
+	bool fpreempt_active;
+
 	bool needs_cmd_parser;
 
 	/*
@@ -932,4 +936,6 @@ void intel_engines_reset_default_submission(struct drm_i915_private *i915);
 
 bool intel_engine_can_store_dword(struct intel_engine_cs *engine);
 
+void intel_engine_queue_fpreempt(struct intel_engine_cs *engine);
+bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine);
 #endif /* _INTEL_RINGBUFFER_H_ */
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.BAT: failure for Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (7 preceding siblings ...)
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
@ 2018-03-16 18:59 ` Patchwork
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
  2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson
  10 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-03-16 18:59 UTC (permalink / raw)
  To: jeff.mcgee; +Cc: intel-gfx

== Series Details ==

Series: Force preemption
URL   : https://patchwork.freedesktop.org/series/40120/
State : failure

== Summary ==

Applying: drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/intel_lrc.c).
error: could not build fake ancestor
Patch failed at 0001 drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail
  2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
@ 2018-03-16 20:30   ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:30 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:59)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Engine reset is fast. A context switch interrupt may be generated just
> prior to the reset such that the top half handler is racing with reset
> post-processing. The handler may set the irq_posted bit again after
> the reset code has cleared it to start fresh. Then the re-enabled
> tasklet will read the CSB head and tail from MMIO, which will be at
> the hardware reset values of 0 and 7 respectively, given that no
> actual CSB event has occurred since the reset. Mayhem then ensues as
> the tasklet starts processing invalid CSB entries.
> 
> We can handle this corner case without adding any new synchronization
> between the irq top half and the reset work item. The tasklet can
> just skip CSB processing if the tail is not sane.
> 
> This patch is required to support the force preemption feature.

Please note how this is handled currently, because this is not a new
problem.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed
  2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
@ 2018-03-16 20:30   ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:30 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:58)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> It is possible for the hardware to be reset just as a context is
> completing. The reset post-processing may see the seqno update and
> assume that the context escaped corruption, but the context may
> have been disrupted in the save out process. The corruption may
> screw up the HEAD and TAIL pointers such that the next submission
> of the context switches out without running the intended request.
> The GEM_BUG_ON will be hit in this situation, but it is not really
> a driver error. So make it a GEM_WARN_ON so that we notice while
> letting hangcheck detect and clean up.
> 
> This patch is required to support the force preemption feature.
> 
> Test: Run IGT gem_exec_fpreempt repeatedly with
>       CONFIG_DRM_I915_DEBUG_GEM.
> Change-Id: I87da4b16bad805fe48153a9ed9169900681ebba7
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index b44861459d24..7d93fcd56d34 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -920,7 +920,7 @@ static void intel_lrc_irq_handler(unsigned long data)
>                         GEM_BUG_ON(count == 0);
>                         if (--count == 0) {
>                                 GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
> -                               GEM_BUG_ON(!i915_gem_request_completed(rq));
> +                               GEM_WARN_ON(!i915_gem_request_completed(rq));

Nope, either way you broke pretty badly.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 7/8] drm/i915: Allow reset without error capture
  2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
@ 2018-03-16 20:33   ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:33 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:04)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Pull the reset handling out of i915_handle_error() so that it can be
> called by that function and directly by the upcoming force preemption
> handler. This allows the force preemption handler to bypass the error
> capture that i915_handle_error() does before getting on with the
> reset. We do not want error capture for force preemption because it
> adds significant latency (~10 msecs measured on APL).

And we don't need to capture the error in many other cases; just add a
control flag.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
@ 2018-03-16 20:39   ` Chris Wilson
  2018-03-16 20:44   ` Chris Wilson
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:39 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:05)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> The hardware can complete the requested preemption at only certain points
> in execution. Thus an uncooperative request that avoids those points can
> block a preemption indefinitely. Our only option to bound the preemption
> latency is to trigger reset and recovery just as we would if a request had
> hung the hardware. This is so-called forced preemption. This change adds
> that capability as an option for systems with extremely strict scheduling
> latency requirements for its high priority requests. This option must be
> used with great care. The force-preempted requests will be discarded at
> the point of reset, resulting in various degrees of disruption to the
> owning application up to and including crash.
> 
> The option is enabled by specifying a non-zero value for new i915 module
> parameter fpreempt_timeout. This value becomes the time in milliseconds
> after initiation of preemption at which the reset is triggered if the
> preemption has not completed normally.
> 
> Test: Run IGT gem_exec_fpreempt.
> Change-Id: Iafd3609012621c57fa9e490dfeeac46ae541b5c2
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         | 14 ++++++++-
>  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
>  drivers/gpu/drm/i915/i915_gem.c         | 37 +++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_params.c      |  3 ++
>  drivers/gpu/drm/i915/i915_params.h      |  1 +
>  drivers/gpu/drm/i915/intel_engine_cs.c  | 53 +++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++++
>  8 files changed, 136 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index b627370d5a9c..3f2394d61ea2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -811,8 +811,16 @@ static int i915_workqueues_init(struct drm_i915_private *dev_priv)
>         if (dev_priv->hotplug.dp_wq == NULL)
>                 goto out_free_wq;
>  
> +       if (INTEL_INFO(dev_priv)->has_logical_ring_preemption) {
> +               dev_priv->fpreempt_wq = alloc_ordered_workqueue("i915-fpreempt",
> +                                                               WQ_HIGHPRI);
> +               if (dev_priv->fpreempt_wq == NULL)
> +                       goto out_free_dp_wq;

Doesn't require ordering, the resets are if either required to fall back
to a full reset.

> +       }
>         return 0;
>  
> +out_free_dp_wq:
> +       destroy_workqueue(dev_priv->hotplug.dp_wq);
>  out_free_wq:
>         destroy_workqueue(dev_priv->wq);
>  out_err:
> @@ -832,6 +840,8 @@ static void i915_engines_cleanup(struct drm_i915_private *i915)
>  
>  static void i915_workqueues_cleanup(struct drm_i915_private *dev_priv)
>  {
> +       if (INTEL_INFO(dev_priv)->has_logical_ring_preemption)
> +               destroy_workqueue(dev_priv->fpreempt_wq);
>         destroy_workqueue(dev_priv->hotplug.dp_wq);
>         destroy_workqueue(dev_priv->wq);
>  }
> @@ -2007,7 +2017,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
>  
>         if (!(flags & I915_RESET_QUIET)) {
>                 dev_notice(engine->i915->drm.dev,
> -                          "Resetting %s after gpu hang\n", engine->name);
> +                          "Resetting %s %s\n", engine->name,
> +                          engine->fpreempt_active ?
> +                          "for force preemption" : "after gpu hang");
>         }
>         error->reset_engine_count[engine->id]++;
>  
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index ade09f97be5c..514e640d8406 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2290,6 +2290,8 @@ struct drm_i915_private {
>          */
>         struct workqueue_struct *wq;
>  
> +       struct workqueue_struct *fpreempt_wq;
> +
>         /* Display functions */
>         struct drm_i915_display_funcs display;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9780d9026ce6..d556743c578a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2811,9 +2811,21 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
>         return request;
>  }
>  
> -static bool engine_stalled(struct intel_engine_cs *engine)
> +static bool engine_stalled(struct intel_engine_cs *engine,
> +                          struct drm_i915_gem_request *request)
>  {
>         if (!intel_vgpu_active(engine->i915)) {
> +               if (engine->fpreempt_active) {
> +                       /* Pardon the request if it managed to complete or
> +                        * preempt prior to the reset.
> +                        */
> +                       if (i915_gem_request_completed(request) ||
> +                           intel_engine_preempt_finished(engine))
> +                               return false;
> +

No, once you pull the trigger just go through with it. You should never
pull the trigger if you are worried about the consequences.

> +                       return true;
> +               }
> +
>                 if (!engine->hangcheck.stalled)
>                         return false;
>  
> @@ -2858,6 +2870,13 @@ i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>         tasklet_kill(&engine->execlists.irq_tasklet);
>         tasklet_disable(&engine->execlists.irq_tasklet);
>  
> +       /* There may be a force preemption timer active on this engine but
> +        * not yet expired, i.e. not the reason we are about to reset this
> +        * engine. Cancel it. If force preemption timeout is the reason we
> +        * are resetting the engine, this call will have no efffect.
> +        */

Done by just clearing execlists->active.

> +       intel_engine_cancel_fpreempt(engine);
> +
>         if (engine->irq_seqno_barrier)
>                 engine->irq_seqno_barrier(engine);
>  
> @@ -2958,7 +2977,7 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
>          * subsequent hangs.
>          */
>  
> -       if (engine_stalled(engine)) {
> +       if (engine_stalled(engine, request)) {
>                 i915_gem_context_mark_guilty(request->ctx);
>                 skip_request(request);
>  
> @@ -2966,6 +2985,13 @@ i915_gem_reset_request(struct intel_engine_cs *engine,
>                 if (i915_gem_context_is_banned(request->ctx))
>                         engine_skip_context(request);
>         } else {
> +               /* If the request that we just pardoned was the target of a
> +                * force preemption there is no possibility of the next
> +                * request in line having started.
> +                */
> +               if (engine->fpreempt_active)
> +                       return NULL;
> +
>                 /*
>                  * Since this is not the hung engine, it may have advanced
>                  * since the hang declaration. Double check by refinding
> @@ -3035,6 +3061,13 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>  
>  void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
>  {
> +       /* Mark any active force preemption as complete then kick
> +        * the tasklet.
> +        */
> +       engine->fpreempt_active = false;
> +       if (engine->execlists.first)
> +               tasklet_schedule(&engine->execlists.irq_tasklet);

Stray chunk.

> +
>         tasklet_enable(&engine->execlists.irq_tasklet);
>         kthread_unpark(engine->breadcrumbs.signaler);
>  }
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 63751a1de74a..9a0deb6e3920 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -247,3 +247,6 @@ MODULE_PARM_DESC(enable_conformance_check, "To toggle the GVT guest conformance
>  
>  module_param_named(disable_gvt_fw_loading, i915_modparams.disable_gvt_fw_loading, bool, 0400);
>  MODULE_PARM_DESC(disable_gvt_fw_loading, "Disable GVT-g fw loading.");
> +
> +i915_param_named(fpreempt_timeout, uint, 0600,
> +       "Wait time in msecs before forcing a preemption with reset (0:never force [default])");
> diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
> index e8c2ba4cb1e6..65fbffa6333c 100644
> --- a/drivers/gpu/drm/i915/i915_params.h
> +++ b/drivers/gpu/drm/i915/i915_params.h
> @@ -54,6 +54,7 @@
>         func(int, edp_vswing); \
>         func(int, reset); \
>         func(unsigned int, inject_load_failure); \
> +       func(unsigned int, fpreempt_timeout); \
>         /* leave bools at the end to not create holes */ \
>         func(bool, alpha_support); \
>         func(bool, enable_cmd_parser); \
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index c9bf2347f7a4..af6cc2d0f7e9 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -411,6 +411,58 @@ static void intel_engine_init_execlist(struct intel_engine_cs *engine)
>         execlists->first = NULL;
>  }
>  
> +void intel_engine_queue_fpreempt(struct intel_engine_cs *engine)
> +{
> +       unsigned int timeout = i915_modparams.fpreempt_timeout;
> +
> +       if (!timeout)
> +               return;
> +
> +       GEM_BUG_ON(engine->fpreempt_active);
> +       hrtimer_start(&engine->fpreempt_timer,
> +                     ms_to_ktime(timeout), HRTIMER_MODE_REL);
> +}
> +
> +bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
> +{
> +       hrtimer_cancel(&engine->fpreempt_timer);
> +
> +       return !engine->fpreempt_active;
> +}
> +
> +static enum hrtimer_restart
> +intel_engine_fpreempt_timer(struct hrtimer *hrtimer)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(hrtimer, struct intel_engine_cs,
> +                            fpreempt_timer);
> +
> +       engine->fpreempt_active = true;
> +       queue_work(engine->i915->fpreempt_wq, &engine->fpreempt_work);

Ah I see you didn't check.

> +
> +       return HRTIMER_NORESTART;
> +}
> +
> +static void intel_engine_fpreempt_work(struct work_struct *work)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(work, struct intel_engine_cs,
> +                            fpreempt_work);
> +
> +       i915_handle_reset(engine->i915, intel_engine_flag(engine));
> +}
> +
> +static void intel_engine_init_fpreempt(struct intel_engine_cs *engine)
> +{
> +       if (!INTEL_INFO(engine->i915)->has_logical_ring_preemption)
> +               return;
> +
> +       hrtimer_init(&engine->fpreempt_timer,
> +                    CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +       engine->fpreempt_timer.function = intel_engine_fpreempt_timer;
> +       INIT_WORK(&engine->fpreempt_work, intel_engine_fpreempt_work);

Just set the few pointers. It's not like we are saving anything.

> +}
> +
>  /**
>   * intel_engines_setup_common - setup engine state not requiring hw access
>   * @engine: Engine to setup.
> @@ -426,6 +478,7 @@ void intel_engine_setup_common(struct intel_engine_cs *engine)
>  
>         intel_engine_init_timeline(engine);
>         intel_engine_init_hangcheck(engine);
> +       intel_engine_init_fpreempt(engine);
>         i915_gem_batch_pool_init(engine, &engine->batch_pool);
>  
>         intel_engine_init_cmd_parser(engine);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 581483886153..17487f8e8b4c 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -628,6 +628,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>                         inject_preempt_context(engine);
>                         execlists_set_active(execlists,
>                                              EXECLISTS_ACTIVE_PREEMPT);
> +                       intel_engine_queue_fpreempt(engine);
>                         goto unlock;
>                 } else {
>                         /*
> @@ -846,6 +847,7 @@ static void intel_lrc_irq_handler(unsigned long data)
>                 const u32 *buf =
>                         &engine->status_page.page_addr[I915_HWS_CSB_BUF0_INDEX];
>                 unsigned int head, tail;
> +               bool defer = false;
>  
>                 /* However GVT emulation depends upon intercepting CSB mmio */
>                 if (unlikely(execlists->csb_use_mmio)) {
> @@ -919,6 +921,21 @@ static void intel_lrc_irq_handler(unsigned long data)
>  
>                         if (status & GEN8_CTX_STATUS_ACTIVE_IDLE &&
>                             buf[2*head + 1] == PREEMPT_ID) {
> +                               /* Try to cancel any pending force preemption.
> +                                * If we are too late, hold off on processing
> +                                * the completed preemption until reset has
> +                                * run its course. It should recognize that
> +                                * the engine has preempted to idle then abort
> +                                * the reset. Then we can resume processing
> +                                * at this CSB head.
> +                                */

This can be much simpler than implied here.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 8/8] drm/i915: Force preemption to complete via engine reset
  2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
  2018-03-16 20:39   ` Chris Wilson
@ 2018-03-16 20:44   ` Chris Wilson
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:44 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:31:05)
> +bool intel_engine_cancel_fpreempt(struct intel_engine_cs *engine)
> +{
> +       hrtimer_cancel(&engine->fpreempt_timer);

Can not be used inside the tasklet as the hrtimer may also be a tasklet
on the same CPU; so busy spin deadlock.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (8 preceding siblings ...)
  2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
@ 2018-03-16 20:53 ` Chris Wilson
  2018-03-16 21:03   ` Chris Wilson
  2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson
  10 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 20:53 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:57)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

Last night I spent 15mins and wrote a similar RFC as Joonas keep
muttering and I thought he was looking for code not that was some...

The real only difference is the approach to handling the pending timer.
I do not agree with your reset changes as they appear to percolate even
more uncertainity through the code. If the preempt-context is in flight
when the reset is triggered, the last active request is still defined by
incomplete request; the new request has not been submitted. This is
serialised by checking for EXECLISTS_ACTIVE_PREEMPT. Up until the point
that is cleared and the next request submitted; we should reset the
engine and resubmit. The race in tasklet handling is immaterial at that
point, you have already paid the cost in using a timer, the workqueue
and serialising for the reset.

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 76ea1da923bd..d9da68efbc86 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3969,8 +3969,8 @@ i915_wedged_set(void *data, u64 val)
                engine->hangcheck.stalled = true;
        }
 
-       i915_handle_error(i915, val, "Manually set wedged engine mask = %llx",
-                         val);
+       i915_handle_error(i915, val, I915_ERROR_CAPTURE,
+                         "Manually set wedged engine mask = %llx", val);
 
        wait_on_bit(&i915->gpu_error.flags,
                    I915_RESET_HANDOFF,
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 26c1dd4b542e..d817b5e55c96 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2755,10 +2755,12 @@ static inline void i915_queue_hangcheck(struct drm_i915_private *dev_priv)
                           &dev_priv->gpu_error.hangcheck_work, delay);
 }
 
-__printf(3, 4)
+__printf(4, 5)
 void i915_handle_error(struct drm_i915_private *dev_priv,
                       u32 engine_mask,
+                      unsigned long flags,
                       const char *fmt, ...);
+#define I915_ERROR_CAPTURE BIT(0)
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5aaf356fd27f..0a255ffe4c51 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2593,6 +2593,7 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @flags: control flags
  * @fmt: Error message format string
  *
  * Do some basic checking of register state at error time and
@@ -2603,16 +2604,11 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  */
 void i915_handle_error(struct drm_i915_private *dev_priv,
                       u32 engine_mask,
+                      unsigned long flags,
                       const char *fmt, ...)
 {
        struct intel_engine_cs *engine;
        unsigned int tmp;
-       va_list args;
-       char error_msg[80];
-
-       va_start(args, fmt);
-       vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-       va_end(args);
 
        /*
         * In most cases it's guaranteed that we get here with an RPM
@@ -2624,8 +2620,18 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
        intel_runtime_pm_get(dev_priv);
 
        engine_mask &= DEVICE_INFO(dev_priv)->ring_mask;
-       i915_capture_error_state(dev_priv, engine_mask, error_msg);
-       i915_clear_error_registers(dev_priv);
+
+       if (flags & I915_ERROR_CAPTURE) {
+               char error_msg[80];
+               va_list args;
+
+               va_start(args, fmt);
+               vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+               va_end(args);
+
+               i915_capture_error_state(dev_priv, engine_mask, error_msg);
+               i915_clear_error_registers(dev_priv);
+       }
 
        /*
         * Try engine reset when available. We fall back to full reset if
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 42e45ae87393..13d1a269c771 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -246,7 +246,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
         */
        tmp = I915_READ_CTL(engine);
        if (tmp & RING_WAIT) {
-               i915_handle_error(dev_priv, 0,
+               i915_handle_error(dev_priv, 0, 0,
                                  "Kicking stuck wait on %s",
                                  engine->name);
                I915_WRITE_CTL(engine, tmp);
@@ -258,7 +258,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
                default:
                        return ENGINE_DEAD;
                case 1:
-                       i915_handle_error(dev_priv, 0,
+                       i915_handle_error(dev_priv, 0, 0,
                                          "Kicking stuck semaphore on %s",
                                          engine->name);
                        I915_WRITE_CTL(engine, tmp);
@@ -392,7 +392,7 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
                                 "%s, ", engine->name);
        msg[len-2] = '\0';
 
-       return i915_handle_error(i915, hung, "%s", msg);
+       return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 396abef55dd3..55f25241a199 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -529,8 +529,35 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
        if (execlists->ctrl_reg)
                writel(EL_CTRL_LOAD, execlists->ctrl_reg);
 
-       execlists_clear_active(&engine->execlists, EXECLISTS_ACTIVE_HWACK);
-       execlists_set_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT);
+       execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
+       execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+       /* Set a timer to force preemption vs hostile userspace */
+       if (execlists->preempt_timeout_ns)
+               hrtimer_start(&execlists->preempt_timer,
+                             ktime_set(0, execlists->preempt_timeout_ns),
+                             HRTIMER_MODE_REL);
+}
+
+static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
+{
+       struct intel_engine_cs *engine =
+               container_of(hrtimer, typeof(*engine), execlists.preempt_timer);
+
+       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+               queue_work(system_highpri_wq, &engine->execlists.preempt_reset);
+
+       return HRTIMER_NORESTART;
+}
+
+static void preempt_reset(struct work_struct *work)
+{
+       struct intel_engine_cs *engine =
+               container_of(work, typeof(*engine), execlists.preempt_reset);
+
+       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+               i915_handle_error(engine->i915, BIT(engine->id), 0,
+                                 "preemption timed out on %s", engine->name);
 }
 
 static void update_rps(struct intel_engine_cs *engine)
@@ -946,6 +973,7 @@ static void execlists_submission_tasklet(unsigned long data)
                                                                EXECLISTS_ACTIVE_PREEMPT));
                                execlists_clear_active(execlists,
                                                       EXECLISTS_ACTIVE_PREEMPT);
+                               hrtimer_try_to_cancel(&execlists->preempt_timer);
                                continue;
                        }
 
@@ -2180,6 +2208,11 @@ logical_ring_setup(struct intel_engine_cs *engine)
        tasklet_init(&engine->execlists.tasklet,
                     execlists_submission_tasklet, (unsigned long)engine);
 
+       INIT_WORK(&engine->execlists.preempt_reset, preempt_reset);
+       hrtimer_init(&engine->execlists.preempt_timer,
+                    CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+       engine->execlists.preempt_timer.function = preempt_timeout;
+
        logical_ring_default_vfuncs(engine);
        logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 495b21fc33db..74fcff8a2a6e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -307,6 +307,10 @@ struct intel_engine_execlists {
         * @preempt_complete_status: expected CSB upon completing preemption
         */
        u32 preempt_complete_status;
+
+       struct hrtimer preempt_timer;
+       struct work_struct preempt_reset;
+       unsigned long preempt_timeout_ns;
 };
 
 #define INTEL_ENGINE_CS_MAX_NAME 8
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index c61a51e46eb2..e99705068190 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -1091,7 +1091,7 @@ static int igt_handle_error(void *arg)
        engine->hangcheck.stalled = true;
        engine->hangcheck.seqno = intel_engine_get_seqno(engine);
 
-       i915_handle_error(i915, intel_engine_flag(engine), "%s", __func__);
+       i915_handle_error(i915, intel_engine_flag(engine), 0, "%s", __func__);
 
        xchg(&i915->gpu_error.first_error, error);


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
@ 2018-03-16 21:03   ` Chris Wilson
  2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
  2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
  0 siblings, 2 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 21:03 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting Chris Wilson (2018-03-16 20:53:01)
> +static void preempt_reset(struct work_struct *work)
> +{
> +       struct intel_engine_cs *engine =
> +               container_of(work, typeof(*engine), execlists.preempt_reset);
> +

So thinking about the races you had in the reset, you need

tasklet_kill()
tasklet_disable();

> +       if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
> +               i915_handle_error(engine->i915, BIT(engine->id), 0,
> +                                 "preemption timed out on %s", engine->name);

tasklet_enable();

here for the serialisation on EXECLISTS_ACTIVE_PREEMPT.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 0/8] Force preemption
  2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
                   ` (9 preceding siblings ...)
  2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
@ 2018-03-16 22:34 ` Chris Wilson
  10 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 22:34 UTC (permalink / raw)
  To: jeff.mcgee, intel-gfx; +Cc: kalyan.kondapally, ben

Quoting jeff.mcgee@intel.com (2018-03-16 18:30:57)
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Force preemption uses engine reset to enforce a limit on the time
> that a request targeted for preemption can block. This feature is
> a requirement in automotive systems where the GPU may be shared by
> clients of critically high priority and clients of low priority that
> may not have been curated to be preemption friendly. There may be
> more general applications of this feature. I'm sharing as an RFC to
> stimulate that discussion and also to get any technical feedback
> that I can before submitting to the product kernel that needs this.
> I have developed the patches for ease of rebase, given that this is
> for the moment considered a non-upstreamable feature. It would be
> possible to refactor hangcheck to fully incorporate force preemption
> as another tier of patience (or impatience) with the running request.

So one big problem I should have realised earlier is ksoftirqd screws up
any latency sensitive operation. I have an ugly idea for a hack, but we
really should cap the latency problems with tasklets.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] force-preempt-reset
  2018-03-16 21:03   ` Chris Wilson
@ 2018-03-16 23:04     ` Chris Wilson
  2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-03-16 23:04 UTC (permalink / raw)
  To: intel-gfx

A more concrete idea of what I think the serialisation should be between
reset timer and tasklet.

---
 drivers/gpu/drm/i915/intel_lrc.c        | 43 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 +++
 2 files changed, 47 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 2c2eb6d3868a..b6fd4323663d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -531,6 +531,39 @@ static void inject_preempt_context(struct intel_engine_cs *engine)
 
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_HWACK);
 	execlists_set_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
+
+	/* Set a timer to force preemption vs hostile userspace */
+	if (execlists->preempt_timeout_ns)
+		hrtimer_start(&execlists->preempt_timer,
+			      ktime_set(0, execlists->preempt_timeout_ns),
+			      HRTIMER_MODE_REL);
+}
+
+static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer)
+{
+	struct intel_engine_cs *engine =
+		container_of(hrtimer, typeof(*engine), execlists.preempt_timer);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+		queue_work(system_highpri_wq, &engine->execlists.preempt_reset);
+
+	return HRTIMER_NORESTART;
+}
+
+static void preempt_reset(struct work_struct *work)
+{
+	struct intel_engine_cs *engine =
+		container_of(work, typeof(*engine), execlists.preempt_reset);
+
+	tasklet_disable(&engine->execlists.tasklet);
+
+	engine->execlists.tasklet.func(engine->execlists.tasklet.data);
+
+	if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT))
+		i915_handle_error(engine->i915, BIT(engine->id), 0,
+				  "preemption timed out on %s", engine->name);
+
+	tasklet_enable(&engine->execlists.tasklet);
 }
 
 static void complete_preempt_context(struct intel_engine_execlists *execlists)
@@ -539,6 +572,11 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
 	execlists_unwind_incomplete_requests(execlists);
 
 	GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
+
+	/* If the timer already fired, complete the reset */
+	if (hrtimer_try_to_cancel(&execlists->preempt_timer) < 0)
+		return;
+
 	execlists_clear_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
 }
 
@@ -2182,6 +2220,11 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->execlists.tasklet,
 		     execlists_submission_tasklet, (unsigned long)engine);
 
+	INIT_WORK(&engine->execlists.preempt_reset, preempt_reset);
+	hrtimer_init(&engine->execlists.preempt_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	engine->execlists.preempt_timer.function = preempt_timeout;
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 495b21fc33db..74fcff8a2a6e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -307,6 +307,10 @@ struct intel_engine_execlists {
 	 * @preempt_complete_status: expected CSB upon completing preemption
 	 */
 	u32 preempt_complete_status;
+
+	struct hrtimer preempt_timer;
+	struct work_struct preempt_reset;
+	unsigned long preempt_timeout_ns;
 };
 
 #define INTEL_ENGINE_CS_MAX_NAME 8
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.BAT: failure for force-preempt-reset
  2018-03-16 21:03   ` Chris Wilson
  2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
@ 2018-03-16 23:45     ` Patchwork
  1 sibling, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-03-16 23:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: force-preempt-reset
URL   : https://patchwork.freedesktop.org/series/40136/
State : failure

== Summary ==

Applying: force-preempt-reset
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/intel_lrc.c).
error: could not build fake ancestor
Patch failed at 0001 force-preempt-reset
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-03-16 23:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 18:30 [RFC 0/8] Force preemption jeff.mcgee
2018-03-16 18:30 ` [RFC 1/8] drm/i915: Downgrade tasklet GEM_BUG_ON for request not completed jeff.mcgee
2018-03-16 20:30   ` Chris Wilson
2018-03-16 18:30 ` [RFC 2/8] drm/i915: Skip CSB processing on invalid CSB tail jeff.mcgee
2018-03-16 20:30   ` Chris Wilson
2018-03-16 18:31 ` [RFC 3/8] drm/i915: Execlists to mark the HWSP upon preemption finished jeff.mcgee
2018-03-16 18:31 ` [RFC 4/8] drm/i915: Add a wait_for routine with more exact timeout jeff.mcgee
2018-03-16 18:31 ` [RFC 5/8] drm/i915: Consider preemption when finding the active request jeff.mcgee
2018-03-16 18:31 ` [RFC 6/8] drm/i915: Repair the preemption context if hit by reset jeff.mcgee
2018-03-16 18:31 ` [RFC 7/8] drm/i915: Allow reset without error capture jeff.mcgee
2018-03-16 20:33   ` Chris Wilson
2018-03-16 18:31 ` [RFC 8/8] drm/i915: Force preemption to complete via engine reset jeff.mcgee
2018-03-16 20:39   ` Chris Wilson
2018-03-16 20:44   ` Chris Wilson
2018-03-16 18:59 ` ✗ Fi.CI.BAT: failure for Force preemption Patchwork
2018-03-16 20:53 ` [RFC 0/8] " Chris Wilson
2018-03-16 21:03   ` Chris Wilson
2018-03-16 23:04     ` [PATCH] force-preempt-reset Chris Wilson
2018-03-16 23:45     ` ✗ Fi.CI.BAT: failure for force-preempt-reset Patchwork
2018-03-16 22:34 ` [RFC 0/8] Force preemption Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.