[PATCH 1/4] drm/i915: Stop allowing the device to be unwedged

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged
@ 2017-03-08 11:40 ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Tvrtko Ursulin, Mika Kuoppala, # v4 . 9+

Since commit 821ed7df6e2a ("drm/i915: Update reset path to fix
incomplete requests"), setting the device as wedged is permanent as we
cannot recover the engine->submit_request. Stop clearing the I915_WEDGED
status to prevent userspace can getting itself in a muddle.

To fix this correctly, we need to stop overriding engine->submit_request
for the inflight requests and instead need to track the errors in
flight. In the meantime, let's start with the correctness fix.

Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: <stable@vger.kernel.org> # v4.9+
---
 drivers/gpu/drm/i915/i915_drv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b1e9027a4f80..1c4f0a21eb22 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,8 +1824,11 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags))
 		return;
 
-	/* Clear any previous failed attempts at recovery. Time to try again. */
-	__clear_bit(I915_WEDGED, &error->flags);
+	if (test_bit(I915_WEDGED, &error->flags)) {
+		wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
+		goto out;
+	}
+
 	error->reset_count++;
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
@@ -1874,6 +1877,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 wakeup:
 	i915_gem_reset_finish(dev_priv);
 	enable_irq(dev_priv->drm.irq);
+out:
 	wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
 	return;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged
@ 2017-03-08 11:40 ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala, # v4 . 9+

Since commit 821ed7df6e2a ("drm/i915: Update reset path to fix
incomplete requests"), setting the device as wedged is permanent as we
cannot recover the engine->submit_request. Stop clearing the I915_WEDGED
status to prevent userspace can getting itself in a muddle.

To fix this correctly, we need to stop overriding engine->submit_request
for the inflight requests and instead need to track the errors in
flight. In the meantime, let's start with the correctness fix.

Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: <stable@vger.kernel.org> # v4.9+
---
 drivers/gpu/drm/i915/i915_drv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b1e9027a4f80..1c4f0a21eb22 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,8 +1824,11 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags))
 		return;
 
-	/* Clear any previous failed attempts at recovery. Time to try again. */
-	__clear_bit(I915_WEDGED, &error->flags);
+	if (test_bit(I915_WEDGED, &error->flags)) {
+		wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
+		goto out;
+	}
+
 	error->reset_count++;
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
@@ -1874,6 +1877,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 wakeup:
 	i915_gem_reset_finish(dev_priv);
 	enable_irq(dev_priv->drm.irq);
+out:
 	wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
 	return;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] drm/i915: Track the error through the sw-fence
  2017-03-08 11:40 ` Chris Wilson
  (?)
@ 2017-03-08 11:40 ` Chris Wilson
  -1 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

In order to handle asynchronous error conditions, we need to store the
error status on the fence itself, and inspect it upon signaling.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_sw_fence.c | 25 ++++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_sw_fence.h |  1 +
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c
index a277f8eb7beb..49043687effc 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence.c
@@ -179,7 +179,7 @@ static void __i915_sw_fence_wake_up_all(struct i915_sw_fence *fence,
 		do {
 			list_for_each_entry_safe(pos, next,
 						 &x->task_list, task_list)
-				pos->func(pos, TASK_NORMAL, 0, &extra);
+				pos->func(pos, TASK_NORMAL, fence->error, &extra);
 
 			if (list_empty(&extra))
 				break;
@@ -240,6 +240,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence,
 	kref_init(&fence->kref);
 	atomic_set(&fence->pending, 1);
 	fence->flags = (unsigned long)fn;
+	fence->error = 0;
 }
 
 static void __i915_sw_fence_commit(struct i915_sw_fence *fence)
@@ -254,13 +255,21 @@ void i915_sw_fence_commit(struct i915_sw_fence *fence)
 	__i915_sw_fence_commit(fence);
 }
 
-static int i915_sw_fence_wake(wait_queue_t *wq, unsigned mode, int flags, void *key)
+static int i915_sw_fence_wake(wait_queue_t *wq, unsigned mode, int err, void *key)
 {
+	struct i915_sw_fence *fence = wq->private;
+
 	list_del(&wq->task_list);
-	__i915_sw_fence_complete(wq->private, key);
-	i915_sw_fence_put(wq->private);
+
+	if (err && !fence->error)
+		fence->error = err;
+
+	__i915_sw_fence_complete(fence, key);
+	i915_sw_fence_put(fence);
+
 	if (wq->flags & I915_SW_FENCE_FLAG_ALLOC)
 		kfree(wq);
+
 	return 0;
 }
 
@@ -402,6 +411,9 @@ static void timer_i915_sw_fence_wake(unsigned long data)
 	dma_fence_put(cb->dma);
 	cb->dma = NULL;
 
+	if (!cb->fence->error)
+		cb->fence->error = -ETIMEDOUT;
+
 	__i915_sw_fence_commit(cb->fence);
 	cb->timer.function = NULL;
 }
@@ -412,8 +424,11 @@ static void dma_i915_sw_fence_wake(struct dma_fence *dma,
 	struct i915_sw_dma_fence_cb *cb = container_of(data, typeof(*cb), base);
 
 	del_timer_sync(&cb->timer);
-	if (cb->timer.function)
+	if (cb->timer.function) {
+		if (dma->error && !cb->fence->error)
+			cb->fence->error = dma->error;
 		__i915_sw_fence_commit(cb->fence);
+	}
 	dma_fence_put(cb->dma);
 
 	kfree(cb);
diff --git a/drivers/gpu/drm/i915/i915_sw_fence.h b/drivers/gpu/drm/i915/i915_sw_fence.h
index d31cefbbcc04..b543ef2ad225 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence.h
+++ b/drivers/gpu/drm/i915/i915_sw_fence.h
@@ -25,6 +25,7 @@ struct i915_sw_fence {
 	unsigned long flags;
 	struct kref kref;
 	atomic_t pending;
+	int error;
 };
 
 #define I915_SW_FENCE_CHECKED_BIT	0 /* used internally for DAG checking */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] drm/i915: Skip cancelled requests in flight
  2017-03-08 11:40 ` Chris Wilson
  (?)
  (?)
@ 2017-03-08 11:40 ` Chris Wilson
  2017-03-08 11:49   ` Chris Wilson
  2017-03-08 12:14   ` [PATCH v2] " Chris Wilson
  -1 siblings, 2 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Check the error of the fence upon submission and skip the request
(clearing out the dispatch and only processing the breadcrumb update) if
we are propagating an earlier failure. This means that we cancel
requests inflight without having to override the engine->submit_request
callback, eliminating the need to stop the machine to do so and allows
us to recover afterwards.

The caveat is that we do depend upon error propagation along the
dependency chains to skip pending requests following the wedged contexts.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 44 +++------------------------------
 drivers/gpu/drm/i915/i915_gem_request.c | 21 ++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_request.h |  2 ++
 3 files changed, 27 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e29f9400c9d1..d2e2dc19bc8f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2778,20 +2778,7 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 
 static void skip_request(struct drm_i915_gem_request *request)
 {
-	void *vaddr = request->ring->vaddr;
-	u32 head;
-
-	/* As this request likely depends on state from the lost
-	 * context, clear out all the user operations leaving the
-	 * breadcrumb at the end (so we get the fence notifications).
-	 */
-	head = request->head;
-	if (request->postfix < head) {
-		memset(vaddr + head, 0, request->ring->size - head);
-		head = 0;
-	}
-	memset(vaddr + head, 0, request->postfix - head);
-
+	i915_gem_request_skip(request);
 	dma_fence_set_error(&request->fence, -EIO);
 }
 
@@ -2915,26 +2902,11 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 	}
 }
 
-static void nop_submit_request(struct drm_i915_gem_request *request)
-{
-	dma_fence_set_error(&request->fence, -EIO);
-	i915_gem_request_submit(request);
-	intel_engine_init_global_seqno(request->engine, request->global_seqno);
-}
-
 static void engine_set_wedged(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 	unsigned long flags;
 
-	/* We need to be sure that no thread is running the old callback as
-	 * we install the nop handler (otherwise we would submit a request
-	 * to hardware that will never complete). In order to prevent this
-	 * race, we wait until the machine is idle before making the swap
-	 * (using stop_machine()).
-	 */
-	engine->submit_request = nop_submit_request;
-
 	/* Mark all executing requests as skipped */
 	spin_lock_irqsave(&engine->timeline->lock, flags);
 	list_for_each_entry(request, &engine->timeline->requests, link)
@@ -2969,24 +2941,16 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 	}
 }
 
-static int __i915_gem_set_wedged_BKL(void *data)
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
 {
-	struct drm_i915_private *i915 = data;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
-	for_each_engine(engine, i915, id)
-		engine_set_wedged(engine);
-
-	return 0;
-}
-
-void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
-{
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 
-	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
+	for_each_engine(engine, dev_priv, id)
+		engine_set_wedged(engine);
 
 	i915_gem_context_lost(dev_priv);
 	i915_gem_retire_requests(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 1e1d9f2072cd..e76c6e83a794 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -477,6 +477,23 @@ void i915_gem_request_unsubmit(struct drm_i915_gem_request *request)
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 }
 
+void i915_gem_request_skip(struct drm_i915_gem_request *request)
+{
+	void *vaddr = request->ring->vaddr;
+	u32 head;
+
+	/* As this request likely depends on state from the lost
+	 * context, clear out all the user operations leaving the
+	 * breadcrumb at the end (so we get the fence notifications).
+	 */
+	head = request->head;
+	if (request->postfix < head) {
+		memset(vaddr + head, 0, request->ring->size - head);
+		head = 0;
+	}
+	memset(vaddr + head, 0, request->postfix - head);
+}
+
 static int __i915_sw_fence_call
 submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 {
@@ -486,6 +503,10 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 	switch (state) {
 	case FENCE_COMPLETE:
 		trace_i915_gem_request_submit(request);
+		if (unlikely(fence->error == -EIO)) {
+			i915_gem_request_skip(request);
+			dma_fence_set_error(&request->fence, fence->error);
+		}
 		request->engine->submit_request(request);
 		break;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index 5018e55922f0..6fdfb801bcee 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -352,6 +352,8 @@ static inline bool i915_spin_request(const struct drm_i915_gem_request *request,
 		__i915_spin_request(request, seqno, state, timeout_us));
 }
 
+void i915_gem_request_skip(struct drm_i915_gem_request *request);
+
 /* We treat requests as fences. This is not be to confused with our
  * "fence registers" but pipeline synchronisation objects ala GL_ARB_sync.
  * We use the fences to synchronize access from the CPU with activity on the
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] Revert "drm/i915: Stop allowing the device to be unwedged"
  2017-03-08 11:40 ` Chris Wilson
                   ` (2 preceding siblings ...)
  (?)
@ 2017-03-08 11:40 ` Chris Wilson
  -1 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

This reverts commit 77705ceb84161c9f5074200460fd6cb26b6a0f93.

With inflight request cancellation now used to track the broken
requests, we can restore the ability to unwedge the machine for igt.

Testcase: igt/gem_eio
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com
---
 drivers/gpu/drm/i915/i915_drv.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 1c4f0a21eb22..b1e9027a4f80 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,11 +1824,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags))
 		return;
 
-	if (test_bit(I915_WEDGED, &error->flags)) {
-		wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
-		goto out;
-	}
-
+	/* Clear any previous failed attempts at recovery. Time to try again. */
+	__clear_bit(I915_WEDGED, &error->flags);
 	error->reset_count++;
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
@@ -1877,7 +1874,6 @@ void i915_reset(struct drm_i915_private *dev_priv)
 wakeup:
 	i915_gem_reset_finish(dev_priv);
 	enable_irq(dev_priv->drm.irq);
-out:
 	wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
 	return;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged
  2017-03-08 11:40 ` Chris Wilson
@ 2017-03-08 11:45   ` Chris Wilson
  -1 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tvrtko Ursulin, Mika Kuoppala, # v4 . 9+

On Wed, Mar 08, 2017 at 11:40:07AM +0000, Chris Wilson wrote:
> Since commit 821ed7df6e2a ("drm/i915: Update reset path to fix
> incomplete requests"), setting the device as wedged is permanent as we
> cannot recover the engine->submit_request. Stop clearing the I915_WEDGED
> status to prevent userspace can getting itself in a muddle.
> 
> To fix this correctly, we need to stop overriding engine->submit_request
> for the inflight requests and instead need to track the errors in
> flight. In the meantime, let's start with the correctness fix.
> 
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: <stable@vger.kernel.org> # v4.9+
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index b1e9027a4f80..1c4f0a21eb22 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1824,8 +1824,11 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags))
>  		return;
>  
> -	/* Clear any previous failed attempts at recovery. Time to try again. */
> -	__clear_bit(I915_WEDGED, &error->flags);
> +	if (test_bit(I915_WEDGED, &error->flags)) {
> +		wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
> +		goto out;

Double wakeup; just goto out;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged
@ 2017-03-08 11:45   ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala, # v4 . 9+

On Wed, Mar 08, 2017 at 11:40:07AM +0000, Chris Wilson wrote:
> Since commit 821ed7df6e2a ("drm/i915: Update reset path to fix
> incomplete requests"), setting the device as wedged is permanent as we
> cannot recover the engine->submit_request. Stop clearing the I915_WEDGED
> status to prevent userspace can getting itself in a muddle.
> 
> To fix this correctly, we need to stop overriding engine->submit_request
> for the inflight requests and instead need to track the errors in
> flight. In the meantime, let's start with the correctness fix.
> 
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: <stable@vger.kernel.org> # v4.9+
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index b1e9027a4f80..1c4f0a21eb22 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1824,8 +1824,11 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags))
>  		return;
>  
> -	/* Clear any previous failed attempts at recovery. Time to try again. */
> -	__clear_bit(I915_WEDGED, &error->flags);
> +	if (test_bit(I915_WEDGED, &error->flags)) {
> +		wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
> +		goto out;

Double wakeup; just goto out;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] drm/i915: Skip cancelled requests in flight
  2017-03-08 11:40 ` [PATCH 3/4] drm/i915: Skip cancelled requests in flight Chris Wilson
@ 2017-03-08 11:49   ` Chris Wilson
  2017-03-08 12:14   ` [PATCH v2] " Chris Wilson
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 11:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

On Wed, Mar 08, 2017 at 11:40:09AM +0000, Chris Wilson wrote:
> @@ -486,6 +503,10 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  	switch (state) {
>  	case FENCE_COMPLETE:
>  		trace_i915_gem_request_submit(request);
> +		if (unlikely(fence->error == -EIO)) {
> +			i915_gem_request_skip(request);
> +			dma_fence_set_error(&request->fence, fence->error);
> +		}

Hmm. This expects the GPU to recover before third party fences complete.
First party completion will propagate the -EIO, but still we submit to a
wedged gpu.

Needs a
		if (i915_terminally_wedged(request->i915)) {
			dma_fence_set_error(-EIO);
			i915_gem_request_submit(request);
			intel_engine_init_global_seqno(request->engine, request->global_seqno);
		} else
>  		request->engine->submit_request(request);
>  		break;

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] drm/i915: Skip cancelled requests in flight
  2017-03-08 11:40 ` [PATCH 3/4] drm/i915: Skip cancelled requests in flight Chris Wilson
  2017-03-08 11:49   ` Chris Wilson
@ 2017-03-08 12:14   ` Chris Wilson
  2017-03-08 15:38     ` Chris Wilson
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 12:14 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Check the error of the fence upon submission and skip the request
(clearing out the dispatch and only processing the breadcrumb update) if
we are propagating an earlier failure. This means that we cancel
requests inflight without having to override the engine->submit_request
callback, eliminating the need to stop the machine to do so and allows
us to recover afterwards.

The caveat is that we do depend upon error propagation along the
dependency chains to skip pending requests following the wedged contexts.

v2: Avoid submitting requests whilst the device is still wedged

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 44 +++------------------------------
 drivers/gpu/drm/i915/i915_gem_request.c | 33 ++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_request.h |  2 ++
 3 files changed, 38 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e29f9400c9d1..d2e2dc19bc8f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2778,20 +2778,7 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 
 static void skip_request(struct drm_i915_gem_request *request)
 {
-	void *vaddr = request->ring->vaddr;
-	u32 head;
-
-	/* As this request likely depends on state from the lost
-	 * context, clear out all the user operations leaving the
-	 * breadcrumb at the end (so we get the fence notifications).
-	 */
-	head = request->head;
-	if (request->postfix < head) {
-		memset(vaddr + head, 0, request->ring->size - head);
-		head = 0;
-	}
-	memset(vaddr + head, 0, request->postfix - head);
-
+	i915_gem_request_skip(request);
 	dma_fence_set_error(&request->fence, -EIO);
 }
 
@@ -2915,26 +2902,11 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 	}
 }
 
-static void nop_submit_request(struct drm_i915_gem_request *request)
-{
-	dma_fence_set_error(&request->fence, -EIO);
-	i915_gem_request_submit(request);
-	intel_engine_init_global_seqno(request->engine, request->global_seqno);
-}
-
 static void engine_set_wedged(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 	unsigned long flags;
 
-	/* We need to be sure that no thread is running the old callback as
-	 * we install the nop handler (otherwise we would submit a request
-	 * to hardware that will never complete). In order to prevent this
-	 * race, we wait until the machine is idle before making the swap
-	 * (using stop_machine()).
-	 */
-	engine->submit_request = nop_submit_request;
-
 	/* Mark all executing requests as skipped */
 	spin_lock_irqsave(&engine->timeline->lock, flags);
 	list_for_each_entry(request, &engine->timeline->requests, link)
@@ -2969,24 +2941,16 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 	}
 }
 
-static int __i915_gem_set_wedged_BKL(void *data)
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
 {
-	struct drm_i915_private *i915 = data;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
-	for_each_engine(engine, i915, id)
-		engine_set_wedged(engine);
-
-	return 0;
-}
-
-void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
-{
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 
-	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
+	for_each_engine(engine, dev_priv, id)
+		engine_set_wedged(engine);
 
 	i915_gem_context_lost(dev_priv);
 	i915_gem_retire_requests(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 1e1d9f2072cd..1edee59e6820 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -477,6 +477,30 @@ void i915_gem_request_unsubmit(struct drm_i915_gem_request *request)
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 }
 
+void i915_gem_request_skip(struct drm_i915_gem_request *request)
+{
+	void *vaddr = request->ring->vaddr;
+	u32 head;
+
+	/* As this request likely depends on state from the lost
+	 * context, clear out all the user operations leaving the
+	 * breadcrumb at the end (so we get the fence notifications).
+	 */
+	head = request->head;
+	if (request->postfix < head) {
+		memset(vaddr + head, 0, request->ring->size - head);
+		head = 0;
+	}
+	memset(vaddr + head, 0, request->postfix - head);
+}
+
+static void wedged_submit_request(struct drm_i915_gem_request *request)
+{
+	dma_fence_set_error(&request->fence, -EIO);
+	i915_gem_request_submit(request);
+	intel_engine_init_global_seqno(request->engine, request->global_seqno);
+}
+
 static int __i915_sw_fence_call
 submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 {
@@ -486,7 +510,14 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 	switch (state) {
 	case FENCE_COMPLETE:
 		trace_i915_gem_request_submit(request);
-		request->engine->submit_request(request);
+		if (unlikely(fence->error == -EIO)) {
+			i915_gem_request_skip(request);
+			dma_fence_set_error(&request->fence, fence->error);
+		}
+		if (i915_terminally_wedged(&request->i915->gpu_error))
+			wedged_submit_request(request);
+		else
+			request->engine->submit_request(request);
 		break;
 
 	case FENCE_FREE:
diff --git a/drivers/gpu/drm/i915/i915_gem_request.h b/drivers/gpu/drm/i915/i915_gem_request.h
index 5018e55922f0..6fdfb801bcee 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -352,6 +352,8 @@ static inline bool i915_spin_request(const struct drm_i915_gem_request *request,
 		__i915_spin_request(request, seqno, state, timeout_us));
 }
 
+void i915_gem_request_skip(struct drm_i915_gem_request *request);
+
 /* We treat requests as fences. This is not be to confused with our
  * "fence registers" but pipeline synchronisation objects ala GL_ARB_sync.
  * We use the fences to synchronize access from the CPU with activity on the
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2)
  2017-03-08 11:40 ` Chris Wilson
                   ` (4 preceding siblings ...)
  (?)
@ 2017-03-08 14:23 ` Patchwork
  2017-03-08 14:29   ` Chris Wilson
  -1 siblings, 1 reply; 12+ messages in thread
From: Patchwork @ 2017-03-08 14:23 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2)
URL   : https://patchwork.freedesktop.org/series/20896/
State : failure

== Summary ==

Series 20896v2 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/20896/revisions/2/mbox/

Test drv_module_reload:
        Subgroup basic-reload:
                pass       -> DMESG-WARN (fi-ilk-650) fdo#100045
Test gem_exec_fence:
        Subgroup await-hang-default:
                pass       -> FAIL       (fi-kbl-7500u)
                pass       -> FAIL       (fi-skl-6260u)
                pass       -> FAIL       (fi-bdw-5557u)
                pass       -> FAIL       (fi-bxt-j4205)
                pass       -> FAIL       (fi-skl-6770hq)
                pass       -> FAIL       (fi-skl-6700k)
                pass       -> FAIL       (fi-bsw-n3050)
                pass       -> DMESG-WARN (fi-ilk-650)
                pass       -> FAIL       (fi-skl-6700hq)
                pass       -> FAIL       (fi-bxt-t5700)
        Subgroup nb-await-default:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                fail       -> PASS       (fi-snb-2600) fdo#100007
                pass       -> INCOMPLETE (fi-skl-6770hq)
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-batch-kernel-default-wb:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-pro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-prw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-ro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-rw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-uc-set-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-pro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-prw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-ro-before-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-ro-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-rw-before-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-rw-default:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-wb-set-default:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_gttfill:
        Subgroup basic:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_nop:
        Subgroup basic-parallel:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-series:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_parallel:
        Subgroup basic:
                pass       -> SKIP       (fi-ilk-650)
Test gem_exec_reloc:
        Subgroup basic-cpu:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-gtt:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-gtt-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-gtt-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-read:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-read-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-cpu-read-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-cpu:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-cpu-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-cpu-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-read:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-read-active:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-gtt-read-noreloc:
                pass       -> SKIP       (fi-ilk-650)
        Subgroup basic-softpin:
WARNING: Long output truncated

035f22af3e974f803031dc9f20e0969d5ae0a433 drm-tip: 2017y-03m-08d-11h-24m-39s UTC integration manifest
1504ed0 Revert "drm/i915: Stop allowing the device to be unwedged"
a462292 drm/i915: Skip cancelled requests in flight
ffe2f47 drm/i915: Track the error through the sw-fence
47e0183 drm/i915: Stop allowing the device to be unwedged

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4095/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ✗ Fi.CI.BAT: failure for series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2)
  2017-03-08 14:23 ` ✗ Fi.CI.BAT: failure for series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2) Patchwork
@ 2017-03-08 14:29   ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 14:29 UTC (permalink / raw)
  To: intel-gfx

On Wed, Mar 08, 2017 at 02:23:10PM -0000, Patchwork wrote:
> == Series Details ==
> 
> Series: series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2)
> URL   : https://patchwork.freedesktop.org/series/20896/
> State : failure
> 
> == Summary ==
> 
> Series 20896v2 Series without cover letter
> https://patchwork.freedesktop.org/api/1.0/series/20896/revisions/2/mbox/
> 
> Test gem_exec_fence:
>         Subgroup await-hang-default:
>                 pass       -> FAIL       (fi-kbl-7500u)
>                 pass       -> FAIL       (fi-skl-6260u)
>                 pass       -> FAIL       (fi-bdw-5557u)
>                 pass       -> FAIL       (fi-bxt-j4205)
>                 pass       -> FAIL       (fi-skl-6770hq)
>                 pass       -> FAIL       (fi-skl-6700k)
>                 pass       -> FAIL       (fi-bsw-n3050)
>                 pass       -> DMESG-WARN (fi-ilk-650)
>                 pass       -> FAIL       (fi-skl-6700hq)
>                 pass       -> FAIL       (fi-bxt-t5700)

Hmm. That'll be the error propagation then cancelling the dependent
requests :|

I wonder who's more correct here, the test or the patch.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] drm/i915: Skip cancelled requests in flight
  2017-03-08 12:14   ` [PATCH v2] " Chris Wilson
@ 2017-03-08 15:38     ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2017-03-08 15:38 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

On Wed, Mar 08, 2017 at 12:14:12PM +0000, Chris Wilson wrote:
>  static int __i915_sw_fence_call
>  submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  {
> @@ -486,7 +510,14 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  	switch (state) {
>  	case FENCE_COMPLETE:
>  		trace_i915_gem_request_submit(request);
> -		request->engine->submit_request(request);
> +		if (unlikely(fence->error == -EIO)) {

I'm not that keen on this error filtering. In the future, I expect
several critical errors (e.g. memory allocation failure whilst async
binding), but we also have non-critical ones like ETIMEDOUT or EAGAIN.

Maybe

if (unlikely(fence->error)) {
	if (critical_fence_error(fence->error))
		i915_gem_request_skip(request);
	dma_fence_set_error(&request->fence, fence->error);
}

static bool critical_fence_error(int err)
{
	switch (err) {
	default:
		MISSING_CASE(err);
	case -EIO:
	case -ENOMEM:
		return true;
	
	case -EAGAIN:
	case -ETIMEDOUT:
		return false;
	}
}

?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-03-08 15:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08 11:40 [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged Chris Wilson
2017-03-08 11:40 ` Chris Wilson
2017-03-08 11:40 ` [PATCH 2/4] drm/i915: Track the error through the sw-fence Chris Wilson
2017-03-08 11:40 ` [PATCH 3/4] drm/i915: Skip cancelled requests in flight Chris Wilson
2017-03-08 11:49   ` Chris Wilson
2017-03-08 12:14   ` [PATCH v2] " Chris Wilson
2017-03-08 15:38     ` Chris Wilson
2017-03-08 11:40 ` [PATCH 4/4] Revert "drm/i915: Stop allowing the device to be unwedged" Chris Wilson
2017-03-08 11:45 ` [PATCH 1/4] drm/i915: Stop allowing the device to be unwedged Chris Wilson
2017-03-08 11:45   ` Chris Wilson
2017-03-08 14:23 ` ✗ Fi.CI.BAT: failure for series starting with [1/4] drm/i915: Stop allowing the device to be unwedged (rev2) Patchwork
2017-03-08 14:29   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.