All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs
@ 2017-07-17  9:11 Chris Wilson
  2017-07-17  9:11 ` [PATCH 02/15] drm/i915: Reset context image on engines after triggering the reset Chris Wilson
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

As part of the knowing whether there is outstanding data in the CSB,
also check whether there is an outstanding IRQ notification.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 1130e3c83c88..5b09bd4d0483 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3395,10 +3395,12 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
 			read = GEN8_CSB_READ_PTR(ptr);
 			write = GEN8_CSB_WRITE_PTR(ptr);
-			seq_printf(m, "\tExeclist CSB read %d [%d cached], write %d [%d from hws]\n",
+			seq_printf(m, "\tExeclist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s\n",
 				   read, engine->csb_head,
 				   write,
-				   intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)));
+				   intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)),
+				   yesno(test_bit(ENGINE_IRQ_EXECLIST,
+						  &engine->irq_posted)));
 			if (read >= GEN8_CSB_ENTRIES)
 				read = 0;
 			if (write >= GEN8_CSB_ENTRIES)
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/15] drm/i915: Reset context image on engines after triggering the reset
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17  9:11 ` [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests Chris Wilson
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

We try to fixup the context image after the reset to ensure that there
are no more pending writes from the hw that may conflict and to fixup
any that were in flight.

Fixes: a1ef70e14453 ("drm/i915: Add support for per engine reset recovery")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index acd0c8b86f9d..901c3ff61527 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1954,6 +1954,8 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 		goto out;
 	}
 
+	ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
+
 	/*
 	 * The request that caused the hang is stuck on elsp, we know the
 	 * active request and can drop it, adjust head to skip the offending
@@ -1961,9 +1963,6 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine, active_request);
 
-	/* Finally, reset just this engine. */
-	ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
-
 	i915_gem_reset_finish_engine(engine);
 
 	if (ret) {
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
  2017-07-17  9:11 ` [PATCH 02/15] drm/i915: Reset context image on engines after triggering the reset Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17 21:58   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 04/15] drm/i915: Flush the execlist ports if idle Chris Wilson
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

We rely on disabling the execlists (by stopping the tasklet) to prevent
new requests from submitting to the engine ELSP before we are ready.
However, we re-enable the engine before we call init_hw which gives
userspace the opportunity to subit a new request which is then
overwritten by init_hw -- but not before the HW may have started
executing. The subsequent out-of-order CSB is detected by our sanity
checks in intel_lrc_irq_handler().

Fixes: a1ef70e14453 ("drm/i915: Add support for per engine reset recovery")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 901c3ff61527..bc121a46ed9a 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1955,6 +1955,12 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	}
 
 	ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
+	if (ret) {
+		/* If we fail here, we expect to fallback to a global reset */
+		DRM_DEBUG_DRIVER("Failed to reset %s, ret=%d\n",
+				 engine->name, ret);
+		goto out;
+	}
 
 	/*
 	 * The request that caused the hang is stuck on elsp, we know the
@@ -1963,15 +1969,6 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine, active_request);
 
-	i915_gem_reset_finish_engine(engine);
-
-	if (ret) {
-		/* If we fail here, we expect to fallback to a global reset */
-		DRM_DEBUG_DRIVER("Failed to reset %s, ret=%d\n",
-				 engine->name, ret);
-		goto out;
-	}
-
 	/*
 	 * The engine and its registers (and workarounds in case of render)
 	 * have been reset to their default values. Follow the init_ring
@@ -1983,6 +1980,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 	error->reset_engine_count[engine->id]++;
 out:
+	i915_gem_reset_finish_engine(engine);
 	return ret;
 }
 
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/15] drm/i915: Flush the execlist ports if idle
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
  2017-07-17  9:11 ` [PATCH 02/15] drm/i915: Reset context image on engines after triggering the reset Chris Wilson
  2017-07-17  9:11 ` [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-20 12:18   ` Mika Kuoppala
  2017-07-20 13:12   ` Mika Kuoppala
  2017-07-17  9:11 ` [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck Chris Wilson
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

When doing a GPU reset, the CSB register will be trashed and we will
lose any context-switch notifications that happened since the tasklet
was disabled. If we find that all requests on this engine were
completed, we want to make sure that the ELSP tracker is similarly empty
so that we do not feed back in the completed requests upon recovering
from the reset.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3c83f2dd6798..ad61d1998fb7 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1327,6 +1327,31 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 {
 	struct execlist_port *port = engine->execlist_port;
 	struct intel_context *ce;
+	unsigned int n;
+
+	/*
+	 * Catch up with any missed context-switch interrupts.
+	 *
+	 * Ideally we would just read the remaining CSB entries now that we
+	 * know the gpu is idle. However, the CSB registers are sometimes^W
+	 * often trashed across a GPU reset! Instead we have to rely on
+	 * guessing the missed context-switch events by looking at what
+	 * requests were completed.
+	 */
+	if (!request) {
+		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
+			i915_gem_request_put(port_request(&port[n]));
+		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
+		return;
+	}
+
+	if (request->ctx != port_request(port)->ctx) {
+		i915_gem_request_put(port_request(port));
+		port[0] = port[1];
+		memset(&port[1], 0, sizeof(port[1]));
+	}
+
+	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
 
 	/* If the request was innocent, we leave the request in the ELSP
 	 * and will try to replay it on restarting. The context image may
@@ -1338,7 +1363,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * and have to at least restore the RING register in the context
 	 * image back to the expected values to skip over the guilty request.
 	 */
-	if (!request || request->fence.error != -EIO)
+	if (request->fence.error != -EIO)
 		return;
 
 	/* We want a simple context + ring to execute the breadcrumb update.
@@ -1360,15 +1385,6 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	request->ring->head = request->postfix;
 	intel_ring_update_space(request->ring);
 
-	/* Catch up with any missed context-switch interrupts */
-	if (request->ctx != port_request(port)->ctx) {
-		i915_gem_request_put(port_request(port));
-		port[0] = port[1];
-		memset(&port[1], 0, sizeof(port[1]));
-	}
-
-	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
-
 	/* Reset WaIdleLiteRestore:bdw,skl as well */
 	request->tail =
 		intel_ring_wrap(request->ring,
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (2 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 04/15] drm/i915: Flush the execlist ports if idle Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17 23:11   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle Chris Wilson
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

Before we declare an engine as idle, check if there are any pending
execlist context-switches and if the ring itself reports as idle.
Otherwise, we may be left in a situation where we miss a crucial
execlist event (or something more sinister) yet the requests complete.
Since the seqno write happens, we believe the engine to be truly idle.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_hangcheck.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 9b0ece427bdc..d9d87d96fb69 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -324,7 +324,7 @@ hangcheck_get_action(struct intel_engine_cs *engine,
 	if (engine->hangcheck.seqno != hc->seqno)
 		return ENGINE_ACTIVE_SEQNO;
 
-	if (i915_seqno_passed(hc->seqno, intel_engine_last_submit(engine)))
+	if (intel_engine_is_idle(engine))
 		return ENGINE_IDLE;
 
 	return engine_stuck(engine, hc->acthd);
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (3 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-20 12:23   ` Mika Kuoppala
  2017-07-17  9:11 ` [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno() Chris Wilson
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

Including a check against the execlist queue before calling the engine
idle and passing hangcheck.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index cba120f3d89d..fbac94557ffa 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1407,6 +1407,10 @@ bool intel_engine_is_idle(struct intel_engine_cs *engine)
 	if (port_request(&engine->execlist_port[0]))
 		return false;
 
+	/* ELSP is empty, but there are ready requests? */
+	if (READ_ONCE(engine->execlist_first))
+		return false;
+
 	/* Ring stopped? */
 	if (!ring_is_idle(engine))
 		return false;
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno()
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (4 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-20 13:16   ` Mika Kuoppala
  2017-07-17  9:11 ` [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging Chris Wilson
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

intel_engine_init_globa_seqno() may be called from an uncontrolled
set-wedged path where we have given up waiting for broken hw and declare
it defunct. Along that path, any sanity checks that the hw is idle
before we adjust its state will expectedly fail, so we simply cannot.
Instead of asserting inside init_global_seqno, we move them to the
normal caller reset_all_global_seqno() as it handles runtime seqno
wraparound.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_request.c | 4 ++++
 drivers/gpu/drm/i915/intel_engine_cs.c  | 3 ---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 483af8921060..d93a185c0f0a 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -213,6 +213,10 @@ static int reset_all_global_seqno(struct drm_i915_private *i915, u32 seqno)
 				cond_resched();
 		}
 
+		/* Check we are idle before we fiddle with hw state! */
+		GEM_BUG_ON(!intel_engine_is_idle(engine));
+		GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
+
 		/* Finally reset hw state */
 		intel_engine_init_global_seqno(engine, seqno);
 		tl->seqno = seqno;
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index fbac94557ffa..19297e6aab41 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -337,9 +337,6 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
-	GEM_BUG_ON(!intel_engine_is_idle(engine));
-	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
-
 	/* Our semaphore implementation is strictly monotonic (i.e. we proceed
 	 * so long as the semaphore value in the register/page is greater
 	 * than the sync value), so whenever we reset the seqno,
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (5 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno() Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-20 13:31   ` Mika Kuoppala
  2017-07-17  9:11 ` [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit Chris Wilson
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

When we wedge the device, we clear out the in-flight requests and
advance the breadcrumb to indicate they are complete. However, the
breadcrumb advance includes an assert that the engine is idle, so that
advancement needs to be the last step to ensure we pass our own sanity
checks.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 580b5042f4f7..40e94b4ef532 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3114,13 +3114,6 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 			dma_fence_set_error(&request->fence, -EIO);
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 
-	/* Mark all pending requests as complete so that any concurrent
-	 * (lockless) lookup doesn't try and wait upon the request as we
-	 * reset it.
-	 */
-	intel_engine_init_global_seqno(engine,
-				       intel_engine_last_submit(engine));
-
 	/*
 	 * Clear the execlists queue up before freeing the requests, as those
 	 * are the ones that keep the context and ringbuffer backing objects
@@ -3149,6 +3142,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 		 */
 		clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
 	}
+
+	/* Mark all pending requests as complete so that any concurrent
+	 * (lockless) lookup doesn't try and wait upon the request as we
+	 * reset it.
+	 */
+	intel_engine_init_global_seqno(engine,
+				       intel_engine_last_submit(engine));
 }
 
 static int __i915_gem_set_wedged_BKL(void *data)
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (6 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-20 13:48   ` Mika Kuoppala
  2017-07-17  9:11 ` [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request Chris Wilson
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

After setting the WEDGED bit, make sure that we do wake up waiters as
they may not be waiting for a request completion yet, just for its
execution.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 40e94b4ef532..ca7a56ff3904 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3157,10 +3157,12 @@ static int __i915_gem_set_wedged_BKL(void *data)
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
-	set_bit(I915_WEDGED, &i915->gpu_error.flags);
 	for_each_engine(engine, i915, id)
 		engine_set_wedged(engine);
 
+	set_bit(I915_WEDGED, &i915->gpu_error.flags);
+	wake_up_all(&i915->gpu_error.reset_queue);
+
 	return 0;
 }
 
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (7 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-18  0:15   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 11/15] drm/i915: Clear engine irq posted following a reset Chris Wilson
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

We should only ever do nop_submit_request when the machine is wedged, so
assert it is so.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ca7a56ff3904..5517373c1bea 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3089,6 +3089,7 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 
 static void nop_submit_request(struct drm_i915_gem_request *request)
 {
+	GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
 	dma_fence_set_error(&request->fence, -EIO);
 	i915_gem_request_submit(request);
 	intel_engine_init_global_seqno(request->engine, request->global_seqno);
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/15] drm/i915: Clear engine irq posted following a reset
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (8 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17 22:05   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 12/15] drm/i915: Make i915_gem_context_mark_guilty() safe for unlocked updates Chris Wilson
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

When the GPU is reset, we want to discard all pending notifications as
either we have manually completed them, or they are no longer
applicable. Make sure we do reset the engine->irq_posted prior to
re-enabling the engine (e.g. the interrupt tasklets) in
i915_gem_reset_finish_engine().

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5517373c1bea..19511020f06e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3027,6 +3027,8 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 void i915_gem_reset_engine(struct intel_engine_cs *engine,
 			   struct drm_i915_gem_request *request)
 {
+	engine->irq_posted = 0;
+
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/15] drm/i915: Make i915_gem_context_mark_guilty() safe for unlocked updates
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (9 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 11/15] drm/i915: Clear engine irq posted following a reset Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17  9:11 ` [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine) Chris Wilson
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

Since we make call i915_gem_context_mark_guilty() concurrently when
resetting different engines in parallel, we need to make sure that our
updates are safe for the unlocked access.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  2 +-
 drivers/gpu/drm/i915/i915_gem.c         | 32 ++++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_gem_context.c |  6 +++---
 drivers/gpu/drm/i915/i915_gem_context.h |  6 +++---
 drivers/gpu/drm/i915/i915_gem_request.c |  3 +--
 drivers/gpu/drm/i915/i915_gpu_error.c   |  8 ++++----
 6 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b04347035a37..98705432ca59 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -602,7 +602,7 @@ struct drm_i915_file_private {
  * to limit the badly behaving clients access to gpu.
  */
 #define I915_MAX_CLIENT_CONTEXT_BANS 3
-	int context_bans;
+	atomic_t context_bans;
 };
 
 /* Used by dp and fdi links */
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 19511020f06e..82569f2a34ed 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2803,34 +2803,38 @@ i915_gem_object_pwrite_gtt(struct drm_i915_gem_object *obj,
 	return 0;
 }
 
-static bool ban_context(const struct i915_gem_context *ctx)
+static bool ban_context(const struct i915_gem_context *ctx,
+			unsigned int score)
 {
 	return (i915_gem_context_is_bannable(ctx) &&
-		ctx->ban_score >= CONTEXT_SCORE_BAN_THRESHOLD);
+		score >= CONTEXT_SCORE_BAN_THRESHOLD);
 }
 
 static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
 {
-	ctx->guilty_count++;
-	ctx->ban_score += CONTEXT_SCORE_GUILTY;
-	if (ban_context(ctx))
-		i915_gem_context_set_banned(ctx);
+	unsigned int score;
+	bool banned;
 
-	DRM_DEBUG_DRIVER("context %s marked guilty (score %d) banned? %s\n",
-			 ctx->name, ctx->ban_score,
-			 yesno(i915_gem_context_is_banned(ctx)));
+	atomic_inc(&ctx->guilty_count);
 
-	if (!i915_gem_context_is_banned(ctx) || IS_ERR_OR_NULL(ctx->file_priv))
+	score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
+	banned = ban_context(ctx, score);
+	DRM_DEBUG_DRIVER("context %s marked guilty (score %d) banned? %s\n",
+			 ctx->name, score, yesno(banned));
+	if (!banned)
 		return;
 
-	ctx->file_priv->context_bans++;
-	DRM_DEBUG_DRIVER("client %s has had %d context banned\n",
-			 ctx->name, ctx->file_priv->context_bans);
+	i915_gem_context_set_banned(ctx);
+	if (!IS_ERR_OR_NULL(ctx->file_priv)) {
+		atomic_inc(&ctx->file_priv->context_bans);
+		DRM_DEBUG_DRIVER("client %s has had %d context banned\n",
+				 ctx->name, atomic_read(&ctx->file_priv->context_bans));
+	}
 }
 
 static void i915_gem_context_mark_innocent(struct i915_gem_context *ctx)
 {
-	ctx->active_count++;
+	atomic_inc(&ctx->active_count);
 }
 
 struct drm_i915_gem_request *
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 1a87d04e7937..ed91ac8ca832 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -977,7 +977,7 @@ int i915_gem_switch_to_kernel_context(struct drm_i915_private *dev_priv)
 
 static bool client_is_banned(struct drm_i915_file_private *file_priv)
 {
-	return file_priv->context_bans > I915_MAX_CLIENT_CONTEXT_BANS;
+	return atomic_read(&file_priv->context_bans) > I915_MAX_CLIENT_CONTEXT_BANS;
 }
 
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
@@ -1179,8 +1179,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	else
 		args->reset_count = 0;
 
-	args->batch_active = READ_ONCE(ctx->guilty_count);
-	args->batch_pending = READ_ONCE(ctx->active_count);
+	args->batch_active = atomic_read(&ctx->guilty_count);
+	args->batch_pending = atomic_read(&ctx->active_count);
 
 	ret = 0;
 out:
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 04320f80f9f4..2d02918a449e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -191,17 +191,17 @@ struct i915_gem_context {
 	u32 desc_template;
 
 	/** guilty_count: How many times this context has caused a GPU hang. */
-	unsigned int guilty_count;
+	atomic_t guilty_count;
 	/**
 	 * @active_count: How many times this context was active during a GPU
 	 * hang, but did not cause it.
 	 */
-	unsigned int active_count;
+	atomic_t active_count;
 
 #define CONTEXT_SCORE_GUILTY		10
 #define CONTEXT_SCORE_BAN_THRESHOLD	40
 	/** ban_score: Accumulated score of all hangs caused by this context. */
-	int ban_score;
+	atomic_t ban_score;
 
 	/** remap_slice: Bitmask of cache lines that need remapping */
 	u8 remap_slice;
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index d93a185c0f0a..68e406a53c04 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -374,8 +374,7 @@ static void i915_gem_request_retire(struct drm_i915_gem_request *request)
 	i915_gem_request_remove_from_client(request);
 
 	/* Retirement decays the ban score as it is a sign of ctx progress */
-	if (request->ctx->ban_score > 0)
-		request->ctx->ban_score--;
+	atomic_dec_if_positive(&request->ctx->ban_score);
 
 	/* The backing object for the context is done after switching to the
 	 * *next* context. Therefore we cannot retire the previous context until
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index ae70283470a6..ed5a1eb839ad 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1266,7 +1266,7 @@ static void record_request(struct drm_i915_gem_request *request,
 			   struct drm_i915_error_request *erq)
 {
 	erq->context = request->ctx->hw_id;
-	erq->ban_score = request->ctx->ban_score;
+	erq->ban_score = atomic_read(&request->ctx->ban_score);
 	erq->seqno = request->global_seqno;
 	erq->jiffies = request->emitted_jiffies;
 	erq->head = request->head;
@@ -1357,9 +1357,9 @@ static void record_context(struct drm_i915_error_context *e,
 
 	e->handle = ctx->user_handle;
 	e->hw_id = ctx->hw_id;
-	e->ban_score = ctx->ban_score;
-	e->guilty = ctx->guilty_count;
-	e->active = ctx->active_count;
+	e->ban_score = atomic_read(&ctx->ban_score);
+	e->guilty = atomic_read(&ctx->guilty_count);
+	e->active = atomic_read(&ctx->active_count);
 }
 
 static void request_record_user_bo(struct drm_i915_gem_request *request,
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine)
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (10 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 12/15] drm/i915: Make i915_gem_context_mark_guilty() safe for unlocked updates Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-18  0:22   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton Chris Wilson
  2017-07-17  9:11 ` [PATCH 15/15] drm/i915/selftests: Exercise independence of per-engine resets Chris Wilson
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

Although a banned context will be told to -EIO off if they try to submit
more requests, we have a discrepancy between whole device resets and
per-engine resets where we report the GPU reset but not the engine
resets. This leaves a bit of mystery as to why the context was banned,
and also reduces awareness overall of when a GPU (engine) reset occurs
with its possible side-effects.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index bc121a46ed9a..4b62fd012877 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1865,9 +1865,10 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
 
+	dev_notice(dev_priv->drm.dev,
+		   "Resetting chip after gpu hang\n");
 	error->reset_count++;
 
-	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 	disable_irq(dev_priv->drm.irq);
 	ret = i915_gem_reset_prepare(dev_priv);
 	if (ret) {
@@ -1945,7 +1946,9 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
 
-	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+	dev_notice(engine->i915->drm.dev,
+		   "Resetting %s after gpu hang\n", engine->name);
+	error->reset_engine_count[engine->id]++;
 
 	active_request = i915_gem_reset_prepare_engine(engine);
 	if (IS_ERR(active_request)) {
@@ -1978,7 +1981,6 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto out;
 
-	error->reset_engine_count[engine->id]++;
 out:
 	i915_gem_reset_finish_engine(engine);
 	return ret;
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (11 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine) Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  2017-07-17 23:57   ` Michel Thierry
  2017-07-17  9:11 ` [PATCH 15/15] drm/i915/selftests: Exercise independence of per-engine resets Chris Wilson
  13 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

Triggering a GPU reset for one engine affects another, notably
corrupting the context status buffer (CSB) effectively losing track of
inflight requests.

Adding a few printks:
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index ad41836fa5e5..a969456bc0fa 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1953,6 +1953,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
                goto out;
        }

+       pr_err("Resetting %s\n", engine->name);
        ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
        if (ret) {
                /* If we fail here, we expect to fallback to a global reset */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 716e5c9ea222..a72bc35d0870 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -355,6 +355,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
                                execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
                        port_set(&port[n], port_pack(rq, count));
                        desc = execlists_update_context(rq);
+                       pr_err("%s: in (rq=%x) ctx=%d\n", engine->name, rq->global_seqno, upper_32_bits(desc));
                        GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
                } else {
                        GEM_BUG_ON(!n);
@@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned long data)
                        if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
                                continue;

+                       pr_err("%s: out CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
+                                       engine->name,
+                                       readl(csb_mmio),
+                                       head, tail,
+                                       readl(buf+2*head+1),
+                                       port->context_id);
+
                        /* Check the context/desc id for this event matches */
-                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
-                                        port->context_id);
+                       if (readl(buf + 2 * head + 1) != port->context_id) {
+                               pr_err("%s: BUG CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
+                                               engine->name,
+                                               readl(csb_mmio),
+                                               head, tail,
+                                               readl(buf+2*head+1),
+                                               port->context_id);
+                               BUG();
+                       }

                        rq = port_unpack(port, &count);
                        GEM_BUG_ON(count == 0);

Results in:

[ 6423.006602] Resetting rcs0
[ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
[ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
[ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
[ 6423.009619] Resetting bcs0
[ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3

Note that this bug may be affect all machines and not just Broxton,
Broxton is just the first machine on which I have confirmed this bug.

Fixes: 142bc7d99bcf ("drm/i915: Modify error handler for per engine hang recovery")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index a1e6b696bcfa..09d97e0990b7 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -398,6 +398,7 @@ static const struct intel_device_info intel_broxton_info = {
 	GEN9_LP_FEATURES,
 	.platform = INTEL_BROXTON,
 	.ddb_size = 512,
+	.has_reset_engine = false,
 };
 
 static const struct intel_device_info intel_geminilake_info = {
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 15/15] drm/i915/selftests: Exercise independence of per-engine resets
  2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
                   ` (12 preceding siblings ...)
  2017-07-17  9:11 ` [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton Chris Wilson
@ 2017-07-17  9:11 ` Chris Wilson
  13 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-17  9:11 UTC (permalink / raw)
  To: intel-gfx

If all goes well, resetting one engine should not affect the operation of
any others. So to test this, we setup a continuous stream of requests
onto to each of the "innocent" engines whilst constantly resetting our
target engine.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 165 +++++++++++++++++++++++
 drivers/gpu/drm/i915/selftests/mock_context.c    |   8 ++
 drivers/gpu/drm/i915/selftests/mock_context.h    |   3 +
 3 files changed, 176 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 7096c3911cd3..dbfcb31ba9f4 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -22,8 +22,13 @@
  *
  */
 
+#include <linux/kthread.h>
+
 #include "../i915_selftest.h"
 
+#include "mock_context.h"
+#include "mock_drm.h"
+
 struct hang {
 	struct drm_i915_private *i915;
 	struct drm_i915_gem_object *hws;
@@ -372,6 +377,165 @@ static int igt_reset_engine(void *arg)
 	return err;
 }
 
+static int active_engine(void *data)
+{
+	struct intel_engine_cs *engine = data;
+	struct drm_i915_gem_request *rq[2] = {};
+	struct i915_gem_context *ctx[2];
+	struct drm_file *file;
+	unsigned long count = 0;
+	int err = 0;
+
+	file = mock_file(engine->i915);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	mutex_lock(&engine->i915->drm.struct_mutex);
+	ctx[0] = live_context(engine->i915, file);
+	mutex_unlock(&engine->i915->drm.struct_mutex);
+	if (IS_ERR(ctx[0])) {
+		err = PTR_ERR(ctx[0]);
+		goto err_file;
+	}
+
+	mutex_lock(&engine->i915->drm.struct_mutex);
+	ctx[1] = live_context(engine->i915, file);
+	mutex_unlock(&engine->i915->drm.struct_mutex);
+	if (IS_ERR(ctx[1])) {
+		err = PTR_ERR(ctx[1]);
+		i915_gem_context_put(ctx[0]);
+		goto err_file;
+	}
+
+	while (!kthread_should_stop()) {
+		unsigned int idx = count++ & 1;
+		struct drm_i915_gem_request *old = rq[idx];
+		struct drm_i915_gem_request *new;
+
+		mutex_lock(&engine->i915->drm.struct_mutex);
+		new = i915_gem_request_alloc(engine, ctx[idx]);
+		if (IS_ERR(new)) {
+			mutex_unlock(&engine->i915->drm.struct_mutex);
+			err = PTR_ERR(new);
+			break;
+		}
+
+		rq[idx] = i915_gem_request_get(new);
+		i915_add_request(new);
+		mutex_unlock(&engine->i915->drm.struct_mutex);
+
+		if (old) {
+			i915_wait_request(old, 0, MAX_SCHEDULE_TIMEOUT);
+			i915_gem_request_put(old);
+		}
+	}
+
+	for (count = 0; count < ARRAY_SIZE(rq); count++)
+		i915_gem_request_put(rq[count]);
+
+err_file:
+	mock_file_free(engine->i915, file);
+	return err;
+}
+
+static int igt_reset_active_engines(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine, *active;
+	enum intel_engine_id id, tmp;
+	int err = 0;
+
+	/* Check that issuing a reset on one engine does not interfere
+	 * with any other engine.
+	 */
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	for_each_engine(engine, i915, id) {
+		struct task_struct *threads[I915_NUM_ENGINES];
+		unsigned long resets[I915_NUM_ENGINES];
+		unsigned long global = i915_reset_count(&i915->gpu_error);
+		IGT_TIMEOUT(end_time);
+
+		memset(threads, 0, sizeof(threads));
+		for_each_engine(active, i915, tmp) {
+			struct task_struct *tsk;
+
+			if (active == engine)
+				continue;
+
+			resets[tmp] = i915_reset_engine_count(&i915->gpu_error,
+							      active);
+
+			tsk = kthread_run(active_engine, active,
+					  "igt/%s", active->name);
+			if (IS_ERR(tsk)) {
+				err = PTR_ERR(tsk);
+				goto unwind;
+			}
+
+			threads[tmp] = tsk;
+			get_task_struct(tsk);
+
+		}
+
+		set_bit(I915_RESET_ENGINE + engine->id, &i915->gpu_error.flags);
+		do {
+			err = i915_reset_engine(engine);
+			if (err) {
+				pr_err("i915_reset_engine(%s) failed, err=%d\n",
+				       engine->name, err);
+				break;
+			}
+		} while (time_before(jiffies, end_time));
+		clear_bit(I915_RESET_ENGINE + engine->id,
+			  &i915->gpu_error.flags);
+
+unwind:
+		for_each_engine(active, i915, tmp) {
+			int ret;
+
+			if (!threads[tmp])
+				continue;
+
+			ret = kthread_stop(threads[tmp]);
+			if (ret) {
+				pr_err("kthread for active engine %s failed, err=%d\n",
+				       active->name, ret);
+				if (!err)
+					err = ret;
+			}
+			put_task_struct(threads[tmp]);
+
+			if (resets[tmp] != i915_reset_engine_count(&i915->gpu_error,
+								   active)) {
+				pr_err("Innocent engine %s was reset (count=%ld)\n",
+				       active->name,
+				       i915_reset_engine_count(&i915->gpu_error,
+							       active) - resets[tmp]);
+				err = -EIO;
+			}
+		}
+
+		if (global != i915_reset_count(&i915->gpu_error)) {
+			pr_err("Global reset (count=%ld)!\n",
+			       i915_reset_count(&i915->gpu_error) - global);
+			err = -EIO;
+		}
+
+		if (err)
+			break;
+
+		cond_resched();
+	}
+
+	if (i915_terminally_wedged(&i915->gpu_error))
+		err = -EIO;
+
+	return err;
+}
+
 static u32 fake_hangcheck(struct drm_i915_gem_request *rq)
 {
 	u32 reset_count;
@@ -689,6 +853,7 @@ int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(igt_hang_sanitycheck),
 		SUBTEST(igt_global_reset),
 		SUBTEST(igt_reset_engine),
+		SUBTEST(igt_reset_active_engines),
 		SUBTEST(igt_wait_reset),
 		SUBTEST(igt_reset_queue),
 		SUBTEST(igt_render_engine_reset_fallback),
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.c b/drivers/gpu/drm/i915/selftests/mock_context.c
index 9c7c68181f82..d436f2d5089b 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.c
+++ b/drivers/gpu/drm/i915/selftests/mock_context.c
@@ -95,3 +95,11 @@ void mock_init_contexts(struct drm_i915_private *i915)
 	INIT_WORK(&i915->contexts.free_work, contexts_free_worker);
 	init_llist_head(&i915->contexts.free_list);
 }
+
+struct i915_gem_context *
+live_context(struct drm_i915_private *i915, struct drm_file *file)
+{
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	return i915_gem_create_context(i915, file->driver_priv);
+}
diff --git a/drivers/gpu/drm/i915/selftests/mock_context.h b/drivers/gpu/drm/i915/selftests/mock_context.h
index 383941a61124..2f432c03d413 100644
--- a/drivers/gpu/drm/i915/selftests/mock_context.h
+++ b/drivers/gpu/drm/i915/selftests/mock_context.h
@@ -33,4 +33,7 @@ mock_context(struct drm_i915_private *i915,
 
 void mock_context_close(struct i915_gem_context *ctx);
 
+struct i915_gem_context *
+live_context(struct drm_i915_private *i915, struct drm_file *file);
+
 #endif /* !__MOCK_CONTEXT_H */
-- 
2.13.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests
  2017-07-17  9:11 ` [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests Chris Wilson
@ 2017-07-17 21:58   ` Michel Thierry
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Thierry @ 2017-07-17 21:58 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> We rely on disabling the execlists (by stopping the tasklet) to prevent
> new requests from submitting to the engine ELSP before we are ready.
> However, we re-enable the engine before we call init_hw which gives
> userspace the opportunity to subit a new request which is then
> overwritten by init_hw -- but not before the HW may have started
> executing. The subsequent out-of-order CSB is detected by our sanity
> checks in intel_lrc_irq_handler().
> 
> Fixes: a1ef70e14453 ("drm/i915: Add support for per engine reset recovery")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.c | 16 +++++++---------
>   1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 901c3ff61527..bc121a46ed9a 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1955,6 +1955,12 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>   	}
>   
>   	ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
> +	if (ret) {
> +		/* If we fail here, we expect to fallback to a global reset */
> +		DRM_DEBUG_DRIVER("Failed to reset %s, ret=%d\n",
> +				 engine->name, ret);
> +		goto out;
> +	}
>   
>   	/*
>   	 * The request that caused the hang is stuck on elsp, we know the
> @@ -1963,15 +1969,6 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>   	 */
>   	i915_gem_reset_engine(engine, active_request);
>   
> -	i915_gem_reset_finish_engine(engine);
> -
> -	if (ret) {
> -		/* If we fail here, we expect to fallback to a global reset */
> -		DRM_DEBUG_DRIVER("Failed to reset %s, ret=%d\n",
> -				 engine->name, ret);
> -		goto out;
> -	}
> -
>   	/*
>   	 * The engine and its registers (and workarounds in case of render)
>   	 * have been reset to their default values. Follow the init_ring
> @@ -1983,6 +1980,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>   
>   	error->reset_engine_count[engine->id]++;
>   out:
> +	i915_gem_reset_finish_engine(engine);
>   	return ret;
>   }
>   
> 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 11/15] drm/i915: Clear engine irq posted following a reset
  2017-07-17  9:11 ` [PATCH 11/15] drm/i915: Clear engine irq posted following a reset Chris Wilson
@ 2017-07-17 22:05   ` Michel Thierry
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Thierry @ 2017-07-17 22:05 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> When the GPU is reset, we want to discard all pending notifications as
> either we have manually completed them, or they are no longer
> applicable. Make sure we do reset the engine->irq_posted prior to
> re-enabling the engine (e.g. the interrupt tasklets) in
> i915_gem_reset_finish_engine().
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 5517373c1bea..19511020f06e 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3027,6 +3027,8 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>   void i915_gem_reset_engine(struct intel_engine_cs *engine,
>   			   struct drm_i915_gem_request *request)
>   {
> +	engine->irq_posted = 0;
> +
>   	if (request && i915_gem_reset_request(request)) {
>   		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
>   				 engine->name, request->global_seqno);
> 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck
  2017-07-17  9:11 ` [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck Chris Wilson
@ 2017-07-17 23:11   ` Michel Thierry
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Thierry @ 2017-07-17 23:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> Before we declare an engine as idle, check if there are any pending
> execlist context-switches and if the ring itself reports as idle.
> Otherwise, we may be left in a situation where we miss a crucial
> execlist event (or something more sinister) yet the requests complete.
> Since the seqno write happens, we believe the engine to be truly idle.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/intel_hangcheck.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 9b0ece427bdc..d9d87d96fb69 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -324,7 +324,7 @@ hangcheck_get_action(struct intel_engine_cs *engine,
>   	if (engine->hangcheck.seqno != hc->seqno)
>   		return ENGINE_ACTIVE_SEQNO;
>   
> -	if (i915_seqno_passed(hc->seqno, intel_engine_last_submit(engine)))
> +	if (intel_engine_is_idle(engine))
>   		return ENGINE_IDLE;
>   
>   	return engine_stuck(engine, hc->acthd);
> 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton
  2017-07-17  9:11 ` [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton Chris Wilson
@ 2017-07-17 23:57   ` Michel Thierry
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Thierry @ 2017-07-17 23:57 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> Triggering a GPU reset for one engine affects another, notably
> corrupting the context status buffer (CSB) effectively losing track of
> inflight requests.
> 
> Adding a few printks:
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index ad41836fa5e5..a969456bc0fa 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>                  goto out;
>          }
> 
> +       pr_err("Resetting %s\n", engine->name);
>          ret = intel_gpu_reset(engine->i915, intel_engine_flag(engine));
>          if (ret) {
>                  /* If we fail here, we expect to fallback to a global reset */
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 716e5c9ea222..a72bc35d0870 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>                                  execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>                          port_set(&port[n], port_pack(rq, count));
>                          desc = execlists_update_context(rq);
> +                       pr_err("%s: in (rq=%x) ctx=%d\n", engine->name, rq->global_seqno, upper_32_bits(desc));
>                          GEM_DEBUG_EXEC(port[n].context_id = upper_32_bits(desc));
>                  } else {
>                          GEM_BUG_ON(!n);
> @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned long data)
>                          if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
>                                  continue;
> 
> +                       pr_err("%s: out CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> +                                       engine->name,
> +                                       readl(csb_mmio),
> +                                       head, tail,
> +                                       readl(buf+2*head+1),
> +                                       port->context_id);
> +
>                          /* Check the context/desc id for this event matches */
> -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 1) !=
> -                                        port->context_id);
> +                       if (readl(buf + 2 * head + 1) != port->context_id) {
> +                               pr_err("%s: BUG CSB (%x head=%d, tail=%d), ctx=%d, rq=%d\n",
> +                                               engine->name,
> +                                               readl(csb_mmio),
> +                                               head, tail,
> +                                               readl(buf+2*head+1),
> +                                               port->context_id);
> +                               BUG();
> +                       }
> 
>                          rq = port_unpack(port, &count);
>                          GEM_BUG_ON(count == 0);
> 
> Results in:
> 
> [ 6423.006602] Resetting rcs0
> [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
> [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
> [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
> [ 6423.009619] Resetting bcs0
> [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
> 
> Note that this bug may be affect all machines and not just Broxton,
> Broxton is just the first machine on which I have confirmed this bug.

Hopefully this is just broxton being broxton... I think I already sent 
this, but anyway...

Acked-by: Michel Thierry <michel.thierry@intel.com>

> 
> Fixes: 142bc7d99bcf ("drm/i915: Modify error handler for per engine hang recovery")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_pci.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index a1e6b696bcfa..09d97e0990b7 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -398,6 +398,7 @@ static const struct intel_device_info intel_broxton_info = {
>   	GEN9_LP_FEATURES,
>   	.platform = INTEL_BROXTON,
>   	.ddb_size = 512,
> +	.has_reset_engine = false,
>   };
>   
>   static const struct intel_device_info intel_geminilake_info = {
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request
  2017-07-17  9:11 ` [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request Chris Wilson
@ 2017-07-18  0:15   ` Michel Thierry
  2017-07-20 12:51     ` Chris Wilson
  0 siblings, 1 reply; 33+ messages in thread
From: Michel Thierry @ 2017-07-18  0:15 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> We should only ever do nop_submit_request when the machine is wedged, so
> assert it is so.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index ca7a56ff3904..5517373c1bea 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3089,6 +3089,7 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
>   
>   static void nop_submit_request(struct drm_i915_gem_request *request)
>   {
> +	GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
>   	dma_fence_set_error(&request->fence, -EIO);
>   	i915_gem_request_submit(request);
>   	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> 

Not sure about this, your patch before this one, ("[PATCH 09/15] 
drm/i915: Wake up waiters after setting the WEDGED bit"), is moving the 
set_bit after engine_set_wedged:

@@ -3157,10 +3157,12 @@ static int __i915_gem_set_wedged_BKL(void *data)
  	struct intel_engine_cs *engine;
  	enum intel_engine_id id;

-	set_bit(I915_WEDGED, &i915->gpu_error.flags);
  	for_each_engine(engine, i915, id)
  		engine_set_wedged(engine);

+	set_bit(I915_WEDGED, &i915->gpu_error.flags);
+	wake_up_all(&i915->gpu_error.reset_queue);
+
  	return 0;
  }

I don't think it won't hit the BUG_ON because i915_gem_set_wedged is 
already protected by stop_machine. Anyway it looks odd.

-Michel
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine)
  2017-07-17  9:11 ` [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine) Chris Wilson
@ 2017-07-18  0:22   ` Michel Thierry
  2017-07-20 12:52     ` Chris Wilson
  0 siblings, 1 reply; 33+ messages in thread
From: Michel Thierry @ 2017-07-18  0:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/07/17 02:11, Chris Wilson wrote:
> Although a banned context will be told to -EIO off if they try to submit
> more requests, we have a discrepancy between whole device resets and
> per-engine resets where we report the GPU reset but not the engine
> resets. This leaves a bit of mystery as to why the context was banned,
> and also reduces awareness overall of when a GPU (engine) reset occurs
> with its possible side-effects.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index bc121a46ed9a..4b62fd012877 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1865,9 +1865,10 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   	if (!i915_gem_unset_wedged(dev_priv))
>   		goto wakeup;
>   
> +	dev_notice(dev_priv->drm.dev,
> +		   "Resetting chip after gpu hang\n");
>   	error->reset_count++;
>   
> -	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>   	disable_irq(dev_priv->drm.irq);
>   	ret = i915_gem_reset_prepare(dev_priv);
>   	if (ret) {
> @@ -1945,7 +1946,9 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>   
>   	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
>   
> -	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> +	dev_notice(engine->i915->drm.dev,
> +		   "Resetting %s after gpu hang\n", engine->name);
> +	error->reset_engine_count[engine->id]++;
>   

This will increment both the engine-reset-count and gpu-reset count in 
the unlikely case that engine-reset gets promoted to full reset.

Not a problem per-se, but I wanted to point it out (plus it makes both 
functions symmetric).

>   	active_request = i915_gem_reset_prepare_engine(engine);
>   	if (IS_ERR(active_request)) {
> @@ -1978,7 +1981,6 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>   	if (ret)
>   		goto out;
>   
> -	error->reset_engine_count[engine->id]++;
>   out:
>   	i915_gem_reset_finish_engine(engine);
>   	return ret;
> 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/15] drm/i915: Flush the execlist ports if idle
  2017-07-17  9:11 ` [PATCH 04/15] drm/i915: Flush the execlist ports if idle Chris Wilson
@ 2017-07-20 12:18   ` Mika Kuoppala
  2017-07-20 12:28     ` Chris Wilson
  2017-07-20 13:12   ` Mika Kuoppala
  1 sibling, 1 reply; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 12:18 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> When doing a GPU reset, the CSB register will be trashed and we will
> lose any context-switch notifications that happened since the tasklet
> was disabled. If we find that all requests on this engine were
> completed, we want to make sure that the ELSP tracker is similarly empty
> so that we do not feed back in the completed requests upon recovering
> from the reset.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 36 ++++++++++++++++++++++++++----------
>  1 file changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3c83f2dd6798..ad61d1998fb7 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1327,6 +1327,31 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  {
>  	struct execlist_port *port = engine->execlist_port;
>  	struct intel_context *ce;
> +	unsigned int n;
> +
> +	/*
> +	 * Catch up with any missed context-switch interrupts.
> +	 *
> +	 * Ideally we would just read the remaining CSB entries now that we
> +	 * know the gpu is idle. However, the CSB registers are sometimes^W
> +	 * often trashed across a GPU reset! Instead we have to rely on
> +	 * guessing the missed context-switch events by looking at what
> +	 * requests were completed.
> +	 */
> +	if (!request) {
> +		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)

You need to check against null before put in here?
-Mika

> +			i915_gem_request_put(port_request(&port[n]));
> +		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
> +		return;
> +	}
> +
> +	if (request->ctx != port_request(port)->ctx) {
> +		i915_gem_request_put(port_request(port));
> +		port[0] = port[1];
> +		memset(&port[1], 0, sizeof(port[1]));
> +	}
> +
> +	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
>  
>  	/* If the request was innocent, we leave the request in the ELSP
>  	 * and will try to replay it on restarting. The context image may
> @@ -1338,7 +1363,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	 * and have to at least restore the RING register in the context
>  	 * image back to the expected values to skip over the guilty request.
>  	 */
> -	if (!request || request->fence.error != -EIO)
> +	if (request->fence.error != -EIO)
>  		return;
>  
>  	/* We want a simple context + ring to execute the breadcrumb update.
> @@ -1360,15 +1385,6 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	request->ring->head = request->postfix;
>  	intel_ring_update_space(request->ring);
>  
> -	/* Catch up with any missed context-switch interrupts */
> -	if (request->ctx != port_request(port)->ctx) {
> -		i915_gem_request_put(port_request(port));
> -		port[0] = port[1];
> -		memset(&port[1], 0, sizeof(port[1]));
> -	}
> -
> -	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
> -
>  	/* Reset WaIdleLiteRestore:bdw,skl as well */
>  	request->tail =
>  		intel_ring_wrap(request->ring,
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle
  2017-07-17  9:11 ` [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle Chris Wilson
@ 2017-07-20 12:23   ` Mika Kuoppala
  0 siblings, 0 replies; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 12:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Including a check against the execlist queue before calling the engine
> idle and passing hangcheck.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/intel_engine_cs.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index cba120f3d89d..fbac94557ffa 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1407,6 +1407,10 @@ bool intel_engine_is_idle(struct intel_engine_cs *engine)
>  	if (port_request(&engine->execlist_port[0]))
>  		return false;
>  
> +	/* ELSP is empty, but there are ready requests? */
> +	if (READ_ONCE(engine->execlist_first))
> +		return false;
> +
>  	/* Ring stopped? */
>  	if (!ring_is_idle(engine))
>  		return false;
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/15] drm/i915: Flush the execlist ports if idle
  2017-07-20 12:18   ` Mika Kuoppala
@ 2017-07-20 12:28     ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-20 12:28 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2017-07-20 13:18:40)
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 3c83f2dd6798..ad61d1998fb7 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1327,6 +1327,31 @@ static void reset_common_ring(struct intel_engine_cs *engine,
> >  {
> >       struct execlist_port *port = engine->execlist_port;
> >       struct intel_context *ce;
> > +     unsigned int n;
> > +
> > +     /*
> > +      * Catch up with any missed context-switch interrupts.
> > +      *
> > +      * Ideally we would just read the remaining CSB entries now that we
> > +      * know the gpu is idle. However, the CSB registers are sometimes^W
> > +      * often trashed across a GPU reset! Instead we have to rely on
> > +      * guessing the missed context-switch events by looking at what
> > +      * requests were completed.
> > +      */
> > +     if (!request) {
> > +             for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
> 
> You need to check against null before put in here?

dma_fence_put and i915_gem_request_put, by extension, are NULL-safe.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request
  2017-07-18  0:15   ` Michel Thierry
@ 2017-07-20 12:51     ` Chris Wilson
  2017-07-20 17:10       ` Michel Thierry
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-20 12:51 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-07-18 01:15:00)
> On 17/07/17 02:11, Chris Wilson wrote:
> > We should only ever do nop_submit_request when the machine is wedged, so
> > assert it is so.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index ca7a56ff3904..5517373c1bea 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -3089,6 +3089,7 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
> >   
> >   static void nop_submit_request(struct drm_i915_gem_request *request)
> >   {
> > +     GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
> >       dma_fence_set_error(&request->fence, -EIO);
> >       i915_gem_request_submit(request);
> >       intel_engine_init_global_seqno(request->engine, request->global_seqno);
> > 
> 
> Not sure about this, your patch before this one, ("[PATCH 09/15] 
> drm/i915: Wake up waiters after setting the WEDGED bit"), is moving the 
> set_bit after engine_set_wedged:

But we are inside a stop_machine()...

> @@ -3157,10 +3157,12 @@ static int __i915_gem_set_wedged_BKL(void *data)
>         struct intel_engine_cs *engine;
>         enum intel_engine_id id;
> 
> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
>         for_each_engine(engine, i915, id)
>                 engine_set_wedged(engine);
> 
> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> +       wake_up_all(&i915->gpu_error.reset_queue);
> +
>         return 0;
>   }
> 
> I don't think it won't hit the BUG_ON because i915_gem_set_wedged is 
> already protected by stop_machine. Anyway it looks odd.

Ah, I did consider it in passing, then rationalised it away because the
stop_machine() gives us the guarantee that we won't run
nop_submit_request() until after the wedging.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine)
  2017-07-18  0:22   ` Michel Thierry
@ 2017-07-20 12:52     ` Chris Wilson
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Wilson @ 2017-07-20 12:52 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-07-18 01:22:28)
> On 17/07/17 02:11, Chris Wilson wrote:
> > Although a banned context will be told to -EIO off if they try to submit
> > more requests, we have a discrepancy between whole device resets and
> > per-engine resets where we report the GPU reset but not the engine
> > resets. This leaves a bit of mystery as to why the context was banned,
> > and also reduces awareness overall of when a GPU (engine) reset occurs
> > with its possible side-effects.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.c | 8 +++++---
> >   1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index bc121a46ed9a..4b62fd012877 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -1865,9 +1865,10 @@ void i915_reset(struct drm_i915_private *dev_priv)
> >       if (!i915_gem_unset_wedged(dev_priv))
> >               goto wakeup;
> >   
> > +     dev_notice(dev_priv->drm.dev,
> > +                "Resetting chip after gpu hang\n");
> >       error->reset_count++;
> >   
> > -     pr_notice("drm/i915: Resetting chip after gpu hang\n");
> >       disable_irq(dev_priv->drm.irq);
> >       ret = i915_gem_reset_prepare(dev_priv);
> >       if (ret) {
> > @@ -1945,7 +1946,9 @@ int i915_reset_engine(struct intel_engine_cs *engine)
> >   
> >       GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
> >   
> > -     DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> > +     dev_notice(engine->i915->drm.dev,
> > +                "Resetting %s after gpu hang\n", engine->name);
> > +     error->reset_engine_count[engine->id]++;
> >   
> 
> This will increment both the engine-reset-count and gpu-reset count in 
> the unlikely case that engine-reset gets promoted to full reset.
> 
> Not a problem per-se, but I wanted to point it out (plus it makes both 
> functions symmetric).

I felt it was justified as then we always increment either counter on
every attempt, not just success, which was the behaviour for the global
counter. I guess should split that out since it is unrelated.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/15] drm/i915: Flush the execlist ports if idle
  2017-07-17  9:11 ` [PATCH 04/15] drm/i915: Flush the execlist ports if idle Chris Wilson
  2017-07-20 12:18   ` Mika Kuoppala
@ 2017-07-20 13:12   ` Mika Kuoppala
  1 sibling, 0 replies; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 13:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> When doing a GPU reset, the CSB register will be trashed and we will
> lose any context-switch notifications that happened since the tasklet
> was disabled. If we find that all requests on this engine were
> completed, we want to make sure that the ELSP tracker is similarly empty
> so that we do not feed back in the completed requests upon recovering
> from the reset.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 36 ++++++++++++++++++++++++++----------
>  1 file changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3c83f2dd6798..ad61d1998fb7 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1327,6 +1327,31 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  {
>  	struct execlist_port *port = engine->execlist_port;
>  	struct intel_context *ce;
> +	unsigned int n;
> +
> +	/*
> +	 * Catch up with any missed context-switch interrupts.
> +	 *
> +	 * Ideally we would just read the remaining CSB entries now that we
> +	 * know the gpu is idle. However, the CSB registers are sometimes^W
> +	 * often trashed across a GPU reset! Instead we have to rely on
> +	 * guessing the missed context-switch events by looking at what
> +	 * requests were completed.
> +	 */
> +	if (!request) {
> +		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++)
> +			i915_gem_request_put(port_request(&port[n]));
> +		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
> +		return;
> +	}
> +
> +	if (request->ctx != port_request(port)->ctx) {
> +		i915_gem_request_put(port_request(port));
> +		port[0] = port[1];
> +		memset(&port[1], 0, sizeof(port[1]));
> +	}
> +
> +	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
>  
>  	/* If the request was innocent, we leave the request in the ELSP
>  	 * and will try to replay it on restarting. The context image may
> @@ -1338,7 +1363,7 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	 * and have to at least restore the RING register in the context
>  	 * image back to the expected values to skip over the guilty request.
>  	 */
> -	if (!request || request->fence.error != -EIO)
> +	if (request->fence.error != -EIO)
>  		return;
>  
>  	/* We want a simple context + ring to execute the breadcrumb update.
> @@ -1360,15 +1385,6 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>  	request->ring->head = request->postfix;
>  	intel_ring_update_space(request->ring);
>  
> -	/* Catch up with any missed context-switch interrupts */
> -	if (request->ctx != port_request(port)->ctx) {
> -		i915_gem_request_put(port_request(port));
> -		port[0] = port[1];
> -		memset(&port[1], 0, sizeof(port[1]));
> -	}
> -
> -	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
> -
>  	/* Reset WaIdleLiteRestore:bdw,skl as well */
>  	request->tail =
>  		intel_ring_wrap(request->ring,
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno()
  2017-07-17  9:11 ` [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno() Chris Wilson
@ 2017-07-20 13:16   ` Mika Kuoppala
  0 siblings, 0 replies; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 13:16 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> intel_engine_init_globa_seqno() may be called from an uncontrolled
> set-wedged path where we have given up waiting for broken hw and declare
> it defunct. Along that path, any sanity checks that the hw is idle
> before we adjust its state will expectedly fail, so we simply cannot.
> Instead of asserting inside init_global_seqno, we move them to the
> normal caller reset_all_global_seqno() as it handles runtime seqno
> wraparound.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/i915_gem_request.c | 4 ++++
>  drivers/gpu/drm/i915/intel_engine_cs.c  | 3 ---
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 483af8921060..d93a185c0f0a 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -213,6 +213,10 @@ static int reset_all_global_seqno(struct drm_i915_private *i915, u32 seqno)
>  				cond_resched();
>  		}
>  
> +		/* Check we are idle before we fiddle with hw state! */
> +		GEM_BUG_ON(!intel_engine_is_idle(engine));
> +		GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
> +
>  		/* Finally reset hw state */
>  		intel_engine_init_global_seqno(engine, seqno);
>  		tl->seqno = seqno;
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index fbac94557ffa..19297e6aab41 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -337,9 +337,6 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
>  {
>  	struct drm_i915_private *dev_priv = engine->i915;
>  
> -	GEM_BUG_ON(!intel_engine_is_idle(engine));
> -	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
> -
>  	/* Our semaphore implementation is strictly monotonic (i.e. we proceed
>  	 * so long as the semaphore value in the register/page is greater
>  	 * than the sync value), so whenever we reset the seqno,
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging
  2017-07-17  9:11 ` [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging Chris Wilson
@ 2017-07-20 13:31   ` Mika Kuoppala
  2017-07-20 13:54     ` Chris Wilson
  0 siblings, 1 reply; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 13:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> When we wedge the device, we clear out the in-flight requests and
> advance the breadcrumb to indicate they are complete. However, the
> breadcrumb advance includes an assert that the engine is idle, so that
> advancement needs to be the last step to ensure we pass our own sanity
> checks.

I am confused about this one. The previous patch seems to make
the concern void.

-Mika

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 580b5042f4f7..40e94b4ef532 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3114,13 +3114,6 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>  			dma_fence_set_error(&request->fence, -EIO);
>  	spin_unlock_irqrestore(&engine->timeline->lock, flags);
>  
> -	/* Mark all pending requests as complete so that any concurrent
> -	 * (lockless) lookup doesn't try and wait upon the request as we
> -	 * reset it.
> -	 */
> -	intel_engine_init_global_seqno(engine,
> -				       intel_engine_last_submit(engine));
> -
>  	/*
>  	 * Clear the execlists queue up before freeing the requests, as those
>  	 * are the ones that keep the context and ringbuffer backing objects
> @@ -3149,6 +3142,13 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>  		 */
>  		clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>  	}
> +
> +	/* Mark all pending requests as complete so that any concurrent
> +	 * (lockless) lookup doesn't try and wait upon the request as we
> +	 * reset it.
> +	 */
> +	intel_engine_init_global_seqno(engine,
> +				       intel_engine_last_submit(engine));
>  }
>  
>  static int __i915_gem_set_wedged_BKL(void *data)
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit
  2017-07-17  9:11 ` [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit Chris Wilson
@ 2017-07-20 13:48   ` Mika Kuoppala
  0 siblings, 0 replies; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 13:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> After setting the WEDGED bit, make sure that we do wake up waiters as
> they may not be waiting for a request completion yet, just for its
> execution.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40e94b4ef532..ca7a56ff3904 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3157,10 +3157,12 @@ static int __i915_gem_set_wedged_BKL(void *data)
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  
> -	set_bit(I915_WEDGED, &i915->gpu_error.flags);
>  	for_each_engine(engine, i915, id)
>  		engine_set_wedged(engine);
>  
> +	set_bit(I915_WEDGED, &i915->gpu_error.flags);
> +	wake_up_all(&i915->gpu_error.reset_queue);
> +
>  	return 0;
>  }
>  
> -- 
> 2.13.2
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging
  2017-07-20 13:31   ` Mika Kuoppala
@ 2017-07-20 13:54     ` Chris Wilson
  2017-07-20 14:14       ` Mika Kuoppala
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wilson @ 2017-07-20 13:54 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2017-07-20 14:31:31)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > When we wedge the device, we clear out the in-flight requests and
> > advance the breadcrumb to indicate they are complete. However, the
> > breadcrumb advance includes an assert that the engine is idle, so that
> > advancement needs to be the last step to ensure we pass our own sanity
> > checks.
> 
> I am confused about this one. The previous patch seems to make
> the concern void.

Yeah, I moved the assert around, but still felt the order imposed by
the memory of that assert was better.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging
  2017-07-20 13:54     ` Chris Wilson
@ 2017-07-20 14:14       ` Mika Kuoppala
  0 siblings, 0 replies; 33+ messages in thread
From: Mika Kuoppala @ 2017-07-20 14:14 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2017-07-20 14:31:31)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > When we wedge the device, we clear out the in-flight requests and
>> > advance the breadcrumb to indicate they are complete. However, the
>> > breadcrumb advance includes an assert that the engine is idle, so that
>> > advancement needs to be the last step to ensure we pass our own sanity
>> > checks.
>> 
>> I am confused about this one. The previous patch seems to make
>> the concern void.
>
> Yeah, I moved the assert around, but still felt the order imposed by
> the memory of that assert was better.

Agreet that the ordering is better.

If you swap these two patches around, you dont have to change the commit
msg. Otherwise change false notion that there is assert in this path.

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request
  2017-07-20 12:51     ` Chris Wilson
@ 2017-07-20 17:10       ` Michel Thierry
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Thierry @ 2017-07-20 17:10 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 7/20/2017 5:51 AM, Chris Wilson wrote:
> Quoting Michel Thierry (2017-07-18 01:15:00)
>> On 17/07/17 02:11, Chris Wilson wrote:
>>> We should only ever do nop_submit_request when the machine is wedged, so
>>> assert it is so.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index ca7a56ff3904..5517373c1bea 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -3089,6 +3089,7 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
>>>    
>>>    static void nop_submit_request(struct drm_i915_gem_request *request)
>>>    {
>>> +     GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
>>>        dma_fence_set_error(&request->fence, -EIO);
>>>        i915_gem_request_submit(request);
>>>        intel_engine_init_global_seqno(request->engine, request->global_seqno);
>>>
>>
>> Not sure about this, your patch before this one, ("[PATCH 09/15]
>> drm/i915: Wake up waiters after setting the WEDGED bit"), is moving the
>> set_bit after engine_set_wedged:
> 
> But we are inside a stop_machine()...
> 
>> @@ -3157,10 +3157,12 @@ static int __i915_gem_set_wedged_BKL(void *data)
>>          struct intel_engine_cs *engine;
>>          enum intel_engine_id id;
>>
>> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
>>          for_each_engine(engine, i915, id)
>>                  engine_set_wedged(engine);
>>
>> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
>> +       wake_up_all(&i915->gpu_error.reset_queue);
>> +
>>          return 0;
>>    }
>>
>> I don't think it won't hit the BUG_ON because i915_gem_set_wedged is
>> already protected by stop_machine. Anyway it looks odd.
> 
> Ah, I did consider it in passing, then rationalised it away because the
> stop_machine() gives us the guarantee that we won't run
> nop_submit_request() until after the wedging.

And the _BKL in the function name implies it too.

(btw my sentence above should have been "I don't think it will hit" or 
"I think it won't hit", but luckily you got it right)

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-07-20 17:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17  9:11 [PATCH 01/15] drm/i915: Report execlists irq bit in debugfs Chris Wilson
2017-07-17  9:11 ` [PATCH 02/15] drm/i915: Reset context image on engines after triggering the reset Chris Wilson
2017-07-17  9:11 ` [PATCH 03/15] drm/i915: Serialize per-engine resets against new requests Chris Wilson
2017-07-17 21:58   ` Michel Thierry
2017-07-17  9:11 ` [PATCH 04/15] drm/i915: Flush the execlist ports if idle Chris Wilson
2017-07-20 12:18   ` Mika Kuoppala
2017-07-20 12:28     ` Chris Wilson
2017-07-20 13:12   ` Mika Kuoppala
2017-07-17  9:11 ` [PATCH 05/15] drm/i915: Check execlist/ring status during hangcheck Chris Wilson
2017-07-17 23:11   ` Michel Thierry
2017-07-17  9:11 ` [PATCH 06/15] drm/i915: Check the execlist queue for pending requests before declaring idle Chris Wilson
2017-07-20 12:23   ` Mika Kuoppala
2017-07-17  9:11 ` [PATCH 07/15] drm/i915: Move idle checks before intel_engine_init_global_seqno() Chris Wilson
2017-07-20 13:16   ` Mika Kuoppala
2017-07-17  9:11 ` [PATCH 08/15] drm/i915: Clear execlist port[] before updating seqno on wedging Chris Wilson
2017-07-20 13:31   ` Mika Kuoppala
2017-07-20 13:54     ` Chris Wilson
2017-07-20 14:14       ` Mika Kuoppala
2017-07-17  9:11 ` [PATCH 09/15] drm/i915: Wake up waiters after setting the WEDGED bit Chris Wilson
2017-07-20 13:48   ` Mika Kuoppala
2017-07-17  9:11 ` [PATCH 10/15] drm/i915: Assert that machine is wedged for nop_submit_request Chris Wilson
2017-07-18  0:15   ` Michel Thierry
2017-07-20 12:51     ` Chris Wilson
2017-07-20 17:10       ` Michel Thierry
2017-07-17  9:11 ` [PATCH 11/15] drm/i915: Clear engine irq posted following a reset Chris Wilson
2017-07-17 22:05   ` Michel Thierry
2017-07-17  9:11 ` [PATCH 12/15] drm/i915: Make i915_gem_context_mark_guilty() safe for unlocked updates Chris Wilson
2017-07-17  9:11 ` [PATCH 13/15] drm/i915: Emit a user level message when resetting the GPU (or engine) Chris Wilson
2017-07-18  0:22   ` Michel Thierry
2017-07-20 12:52     ` Chris Wilson
2017-07-17  9:11 ` [PATCH 14/15] drm/i915: Disable per-engine reset for Broxton Chris Wilson
2017-07-17 23:57   ` Michel Thierry
2017-07-17  9:11 ` [PATCH 15/15] drm/i915/selftests: Exercise independence of per-engine resets Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.