[PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
@ 2012-06-26 21:08 Daniel Vetter
  2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Having this early check in intel_ring_begin doesn't buy us anything,
since we'll be calling into wait_request in the usual case already
anyway. In the corner case of not waiting for free space using the
last_retired_head we simply need to do the same check, too.

With these changes we'll only ever get an -EIO from intel_ring_begin
if the gpu has truely been declared dead.

v2: Don't conflate the change to prevent intel_ring_begin from returning
a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
just prep work to avoid returning -EAGAIN in callsites that can't handle
syscall restarting.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f30a53a..501546e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 int intel_ring_begin(struct intel_ring_buffer *ring,
 		     int num_dwords)
 {
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	int n = 4*num_dwords;
 	int ret;

-	if (unlikely(atomic_read(&dev_priv->mm.wedged)))
-		return -EIO;
-
 	if (unlikely(ring->tail + n > ring->effective_size)) {
 		ret = intel_wrap_ring_buffer(ring);
 		if (unlikely(ret))
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] drm/i915: don't trylock in the gpu reset code
  2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
@ 2012-06-26 21:08 ` Daniel Vetter
  2012-06-26 22:08   ` Łukasz Kuryło
  2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Simply failing to reset the gpu because someone else might still hold
the mutex isn't a great idea - I see reliable silent reset failures.
And gpu reset simply needs to be reliable and Just Work.

"But ... the deadlocks!"

We already kick all processes waiting for the gpu before launching the
reset work item. New waiters need to check the wedging state anyway
and then bail out. If we have places that can deadlock, we simply need
to fix them.

"But ... testing!"

We have the gpu hangman, and if the current gpu load gem_exec_nop
isn't good enough to hit a specific case, we can add a new one.

"But ...  don't we return -EAGAIN for non-interruptible calls to
wait_seqno now?"

Yep, but this problem already exists in the current code. A follow up
patch will remedy this by returning -EIO for non-interruptible sleeps
if the gpu died and the low-level wait bails out with -EAGAIN.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 79be879..0e114f7 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -866,8 +866,7 @@ int i915_reset(struct drm_device *dev)
 	if (!i915_try_reset)
 		return 0;

-	if (!mutex_trylock(&dev->struct_mutex))
-		return -EBUSY;
+	mutex_lock(&dev->struct_mutex);

 	i915_gem_reset(dev);

-- 
1.7.10

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN
  2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
  2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
@ 2012-06-26 21:08 ` Daniel Vetter
  2012-06-27 15:19   ` Ben Widawsky
  2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO
instead.

This is a bit ugly because intel_ring_begin is all non-interruptible
and hence only returns -EIO. But as the comment in there says,
auditing all the callsites would be a pain.

To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno
and intel_wait_ring_buffer. Also use the opportunity to clarify the
different cases in i915_gem_check_wedge a bit with comments.

v2: Don't access dev_priv->mm.interruptible from check_wedge - we
might not hold dev->struct_mutex, making this racy. Instead pass
interruptible in as a parameter. I've noticed this because I've hit a
BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been
added in

commit b4aca0106c466b5a0329318203f65bac2d91b682
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Wed Apr 25 20:50:12 2012 -0700

    drm/i915: extract some common olr+wedge code

although that commit is missing any justification for this it. I guess
it's just copy&paste, because the same commit add the same BUG_ON
check to check_olr, where it indeed makes sense.

But in check_wedge everything we access is protected by other means,
so this is superflous. And because it now gets in the way (we add a
new caller in __wait_seqno, which can be called without
dev->struct_mutext) let's just remove it.

v3: Group all the i915_gem_check_wedge refactoring into this patch, so
that this patch here is all about not returning -EAGAIN to callsites
that can't handle syscall restarting.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 ++
 drivers/gpu/drm/i915/i915_gem.c         |   26 ++++++++++++++++++--------
 drivers/gpu/drm/i915/intel_ringbuffer.c |    6 ++++--
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a0c15ab..ab9ade0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj)
 
 void i915_gem_retire_requests(struct drm_device *dev);
 void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring);
+int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+				      bool interruptible);
 
 void i915_gem_reset(struct drm_device *dev);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6a98c06..af6a510 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	mutex_unlock(&dev->struct_mutex);
 }
 
-static int
-i915_gem_check_wedge(struct drm_i915_private *dev_priv)
+int
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+		     bool interruptible)
 {
-	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
-
 	if (atomic_read(&dev_priv->mm.wedged)) {
 		struct completion *x = &dev_priv->error_completion;
 		bool recovery_complete;
@@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv)
 		recovery_complete = x->done > 0;
 		spin_unlock_irqrestore(&x->wait.lock, flags);
 
-		return recovery_complete ? -EIO : -EAGAIN;
+		/* Non-interruptible callers can't handle -EAGAIN, hence return
+		 * -EIO unconditionally for these. */
+		if (!interruptible)
+			return -EIO;
+
+		/* Recovery complete, but still wedged means reset failure. */
+		if (recovery_complete)
+			return -EIO;
+
+		return -EAGAIN;
 	}
 
 	return 0;
@@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 	unsigned long timeout_jiffies;
 	long end;
 	bool wait_forever = true;
+	int ret;
 
 	if (i915_seqno_passed(ring->get_seqno(ring), seqno))
 		return 0;
@@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 			end = wait_event_timeout(ring->irq_queue, EXIT_COND,
 						 timeout_jiffies);
 
-		if (atomic_read(&dev_priv->mm.wedged))
-			end = -EAGAIN;
+		ret = i915_gem_check_wedge(dev_priv, interruptible);
+		if (ret)
+			end = ret;
 	} while (end == 0 && wait_forever);
 
 	getrawmonotonic(&now);
@@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno)
 
 	BUG_ON(seqno == 0);
 
-	ret = i915_gem_check_wedge(dev_priv);
+	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 501546e..6c024d4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 		}
 
 		msleep(1);
-		if (atomic_read(&dev_priv->mm.wedged))
-			return -EAGAIN;
+
+		ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
+		if (ret)
+			return ret;
 	} while (!time_after(jiffies, end));
 	trace_i915_ring_wait_end(ring);
 	return -EBUSY;
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck
  2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
  2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
  2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
@ 2012-06-26 21:08 ` Daniel Vetter
  2012-07-01  3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky
  2012-07-03 15:59 ` Chris Wilson
  4 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 21:08 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

With the gpu reset no longer using a trylock we've the chances of
userspace getting stuck quite a bit. To make that (hopefully) rare
case more paletable time out when waiting for the gpu reset code to
complete and signal this little issue to the caller by returning -EIO.

This should help userspace to somewhat gracefully fall back and
hopefully allow the user to grab some logs and reboot the machine
(instead of staring at a frozen X screen in agony).

Suggested by Chris Wilson because I've been stubborn about allowing
the gpu reset code to fail.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index af6a510..7d28555 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -96,9 +96,18 @@ i915_gem_wait_for_error(struct drm_device *dev)
 	if (!atomic_read(&dev_priv->mm.wedged))
 		return 0;
 
-	ret = wait_for_completion_interruptible(x);
-	if (ret)
+	/*
+	 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
+	 * userspace. If it takes that long something really bad is going on and
+	 * we should simply try to bail out and fail as gracefully as possible.
+	 */
+	ret = wait_for_completion_interruptible_timeout(x, 10*HZ);
+	if (ret == 0) {
+		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
+		return -EIO;
+	} else if (ret < 0) {
 		return ret;
+	}
 
 	if (atomic_read(&dev_priv->mm.wedged)) {
 		/* GPU is hung, bump the completion count to account for
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] drm/i915: don't trylock in the gpu reset code
  2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
@ 2012-06-26 22:08   ` Łukasz Kuryło
  2012-06-26 22:34     ` Daniel Vetter
  0 siblings, 1 reply; 14+ messages in thread
From: Łukasz Kuryło @ 2012-06-26 22:08 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

>
> -       if (!mutex_trylock(&dev->struct_mutex))
> -               return -EBUSY;
> +       mutex_lock(&dev->struct_mutex);
>
>        i915_gem_reset(dev);
>

But ... the original code:

Correct me if I'm wrong.
In every manual I've found mutex_trylock(...) returns 0 on success.
So

if(!mutex_trylock(&dev->struct_mutex))
   return -EBUSY;

would actually execute when mutex was acquired as requested.
We then return without releasing it and possibly end up with deadlock.

On the other hand if mutex was already locked, by other or maybe even
the same thread, mutex_trylock returns error (some nonzero value),
nonzero means true ... if(!true) -> if(false) means do not execute the
"if" code. In this case in spite of not getting the mutex we would
proceed.

You've written:
"Simply failing to reset the gpu because someone else might still hold
the mutex isn't a great idea - I see reliable silent reset failures.
And gpu reset simply needs to be reliable and Just Work."

I believe that was not the case with the original code.
>From my understanding this procedure failed to reset gpu when the
mutex was unlocked (locking it and bailing out) and tried to reset it
if the mutex could not be acquired.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] drm/i915: don't trylock in the gpu reset code
  2012-06-26 22:08   ` Łukasz Kuryło
@ 2012-06-26 22:34     ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-06-26 22:34 UTC (permalink / raw)
  To: Łukasz Kuryło; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Jun 27, 2012 at 12:08:33AM +0200, Łukasz Kuryło wrote:
> >
> > -       if (!mutex_trylock(&dev->struct_mutex))
> > -               return -EBUSY;
> > +       mutex_lock(&dev->struct_mutex);
> >
> >        i915_gem_reset(dev);
> >
> 
> But ... the original code:
> 
> Correct me if I'm wrong.
> In every manual I've found mutex_trylock(...) returns 0 on success.
> So

Qoting the kernel doc for mutex_trylock:

"Try to acquire the mutex atomically. Returns 1 if the mutex has been
acquired successfully, and 0 on contention."

Also, it's open-source, so you could double-check the implementation ...

-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN
  2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
@ 2012-06-27 15:19   ` Ben Widawsky
  2012-06-27 16:36     ` Daniel Vetter
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Widawsky @ 2012-06-27 15:19 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

On Tue, 26 Jun 2012 23:08:52 +0200
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO
> instead.

What I'd really like to see in this rather long commit message is what
exactly happens in this case that's being fixed (maybe I should know,
but I don't).

> 
> This is a bit ugly because intel_ring_begin is all non-interruptible
> and hence only returns -EIO. But as the comment in there says,
> auditing all the callsites would be a pain.
> 
> To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno
> and intel_wait_ring_buffer. Also use the opportunity to clarify the
> different cases in i915_gem_check_wedge a bit with comments.
> 
> v2: Don't access dev_priv->mm.interruptible from check_wedge - we
> might not hold dev->struct_mutex, making this racy. Instead pass
> interruptible in as a parameter. I've noticed this because I've hit a
> BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been
> added in
> 
> commit b4aca0106c466b5a0329318203f65bac2d91b682
> Author: Ben Widawsky <ben@bwidawsk.net>
> Date:   Wed Apr 25 20:50:12 2012 -0700
> 
>     drm/i915: extract some common olr+wedge code
> 
> although that commit is missing any justification for this it. I guess
> it's just copy&paste, because the same commit add the same BUG_ON
> check to check_olr, where it indeed makes sense.
> 
> But in check_wedge everything we access is protected by other means,
> so this is superflous. And because it now gets in the way (we add a
> new caller in __wait_seqno, which can be called without
> dev->struct_mutext) let's just remove it.
> 
> v3: Group all the i915_gem_check_wedge refactoring into this patch, so
> that this patch here is all about not returning -EAGAIN to callsites
> that can't handle syscall restarting.
> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

> ---
>  drivers/gpu/drm/i915/i915_drv.h         |    2 ++
>  drivers/gpu/drm/i915/i915_gem.c         |   26 ++++++++++++++++++--------
>  drivers/gpu/drm/i915/intel_ringbuffer.c |    6 ++++--
>  3 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a0c15ab..ab9ade0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj)
>  
>  void i915_gem_retire_requests(struct drm_device *dev);
>  void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring);
> +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> +				      bool interruptible);
>  
>  void i915_gem_reset(struct drm_device *dev);
>  void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 6a98c06..af6a510 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work)
>  	mutex_unlock(&dev->struct_mutex);
>  }
>  
> -static int
> -i915_gem_check_wedge(struct drm_i915_private *dev_priv)
> +int
> +i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> +		     bool interruptible)
>  {
> -	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> -
>  	if (atomic_read(&dev_priv->mm.wedged)) {
>  		struct completion *x = &dev_priv->error_completion;
>  		bool recovery_complete;
> @@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv)
>  		recovery_complete = x->done > 0;
>  		spin_unlock_irqrestore(&x->wait.lock, flags);
>  
> -		return recovery_complete ? -EIO : -EAGAIN;
> +		/* Non-interruptible callers can't handle -EAGAIN, hence return
> +		 * -EIO unconditionally for these. */
> +		if (!interruptible)
> +			return -EIO;
> +
> +		/* Recovery complete, but still wedged means reset failure. */
> +		if (recovery_complete)
> +			return -EIO;
> +
> +		return -EAGAIN;
>  	}
>  
>  	return 0;
> @@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
>  	unsigned long timeout_jiffies;
>  	long end;
>  	bool wait_forever = true;
> +	int ret;
>  
>  	if (i915_seqno_passed(ring->get_seqno(ring), seqno))
>  		return 0;
> @@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
>  			end = wait_event_timeout(ring->irq_queue, EXIT_COND,
>  						 timeout_jiffies);
>  
> -		if (atomic_read(&dev_priv->mm.wedged))
> -			end = -EAGAIN;
> +		ret = i915_gem_check_wedge(dev_priv, interruptible);
> +		if (ret)
> +			end = ret;
>  	} while (end == 0 && wait_forever);
>  
>  	getrawmonotonic(&now);
> @@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno)
>  
>  	BUG_ON(seqno == 0);
>  
> -	ret = i915_gem_check_wedge(dev_priv);
> +	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 501546e..6c024d4 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
>  		}
>  
>  		msleep(1);
> -		if (atomic_read(&dev_priv->mm.wedged))
> -			return -EAGAIN;
> +
> +		ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
> +		if (ret)
> +			return ret;
>  	} while (!time_after(jiffies, end));
>  	trace_i915_ring_wait_end(ring);
>  	return -EBUSY;



-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN
  2012-06-27 15:19   ` Ben Widawsky
@ 2012-06-27 16:36     ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-06-27 16:36 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Jun 27, 2012 at 08:19:03AM -0700, Ben Widawsky wrote:
> On Tue, 26 Jun 2012 23:08:52 +0200
> Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > So don't return -EAGAIN, even in the case of a gpu hang. Remap it to -EIO
> > instead.
> 
> What I'd really like to see in this rather long commit message is what
> exactly happens in this case that's being fixed (maybe I should know,
> but I don't).

What about adding: "Note that this isn't really an issue with
interruptability, but more that we have quite a few codepaths (mostly
around kms stuff) that simply can't handle any errors and hence not even
-EGAIN. Instead of adding proper failure paths so that we could restart
these ioctls we've opted for the cheap way out of sleeping
non-interruptibly.  Which works everywhere but when the gpu dies, which
this patch fixes.

So essentially interruptible == false means 'wait for the gpu or die
trying'."

-Daniel

> 
> > 
> > This is a bit ugly because intel_ring_begin is all non-interruptible
> > and hence only returns -EIO. But as the comment in there says,
> > auditing all the callsites would be a pain.
> > 
> > To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno
> > and intel_wait_ring_buffer. Also use the opportunity to clarify the
> > different cases in i915_gem_check_wedge a bit with comments.
> > 
> > v2: Don't access dev_priv->mm.interruptible from check_wedge - we
> > might not hold dev->struct_mutex, making this racy. Instead pass
> > interruptible in as a parameter. I've noticed this because I've hit a
> > BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been
> > added in
> > 
> > commit b4aca0106c466b5a0329318203f65bac2d91b682
> > Author: Ben Widawsky <ben@bwidawsk.net>
> > Date:   Wed Apr 25 20:50:12 2012 -0700
> > 
> >     drm/i915: extract some common olr+wedge code
> > 
> > although that commit is missing any justification for this it. I guess
> > it's just copy&paste, because the same commit add the same BUG_ON
> > check to check_olr, where it indeed makes sense.
> > 
> > But in check_wedge everything we access is protected by other means,
> > so this is superflous. And because it now gets in the way (we add a
> > new caller in __wait_seqno, which can be called without
> > dev->struct_mutext) let's just remove it.
> > 
> > v3: Group all the i915_gem_check_wedge refactoring into this patch, so
> > that this patch here is all about not returning -EAGAIN to callsites
> > that can't handle syscall restarting.
> > 
> > Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
> 
> > ---
> >  drivers/gpu/drm/i915/i915_drv.h         |    2 ++
> >  drivers/gpu/drm/i915/i915_gem.c         |   26 ++++++++++++++++++--------
> >  drivers/gpu/drm/i915/intel_ringbuffer.c |    6 ++++--
> >  3 files changed, 24 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index a0c15ab..ab9ade0 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -1330,6 +1330,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj)
> >  
> >  void i915_gem_retire_requests(struct drm_device *dev);
> >  void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring);
> > +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> > +				      bool interruptible);
> >  
> >  void i915_gem_reset(struct drm_device *dev);
> >  void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 6a98c06..af6a510 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work)
> >  	mutex_unlock(&dev->struct_mutex);
> >  }
> >  
> > -static int
> > -i915_gem_check_wedge(struct drm_i915_private *dev_priv)
> > +int
> > +i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> > +		     bool interruptible)
> >  {
> > -	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> > -
> >  	if (atomic_read(&dev_priv->mm.wedged)) {
> >  		struct completion *x = &dev_priv->error_completion;
> >  		bool recovery_complete;
> > @@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv)
> >  		recovery_complete = x->done > 0;
> >  		spin_unlock_irqrestore(&x->wait.lock, flags);
> >  
> > -		return recovery_complete ? -EIO : -EAGAIN;
> > +		/* Non-interruptible callers can't handle -EAGAIN, hence return
> > +		 * -EIO unconditionally for these. */
> > +		if (!interruptible)
> > +			return -EIO;
> > +
> > +		/* Recovery complete, but still wedged means reset failure. */
> > +		if (recovery_complete)
> > +			return -EIO;
> > +
> > +		return -EAGAIN;
> >  	}
> >  
> >  	return 0;
> > @@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
> >  	unsigned long timeout_jiffies;
> >  	long end;
> >  	bool wait_forever = true;
> > +	int ret;
> >  
> >  	if (i915_seqno_passed(ring->get_seqno(ring), seqno))
> >  		return 0;
> > @@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
> >  			end = wait_event_timeout(ring->irq_queue, EXIT_COND,
> >  						 timeout_jiffies);
> >  
> > -		if (atomic_read(&dev_priv->mm.wedged))
> > -			end = -EAGAIN;
> > +		ret = i915_gem_check_wedge(dev_priv, interruptible);
> > +		if (ret)
> > +			end = ret;
> >  	} while (end == 0 && wait_forever);
> >  
> >  	getrawmonotonic(&now);
> > @@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno)
> >  
> >  	BUG_ON(seqno == 0);
> >  
> > -	ret = i915_gem_check_wedge(dev_priv);
> > +	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
> >  	if (ret)
> >  		return ret;
> >  
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 501546e..6c024d4 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -1220,8 +1220,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
> >  		}
> >  
> >  		msleep(1);
> > -		if (atomic_read(&dev_priv->mm.wedged))
> > -			return -EAGAIN;
> > +
> > +		ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
> > +		if (ret)
> > +			return ret;
> >  	} while (!time_after(jiffies, end));
> >  	trace_i915_ring_wait_end(ring);
> >  	return -EBUSY;
> 
> 
> 
> -- 
> Ben Widawsky, Intel Open Source Technology Center

-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
                   ` (2 preceding siblings ...)
  2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter
@ 2012-07-01  3:09 ` Ben Widawsky
  2012-07-01 10:41   ` Daniel Vetter
  2012-07-03 15:59 ` Chris Wilson
  4 siblings, 1 reply; 14+ messages in thread
From: Ben Widawsky @ 2012-07-01  3:09 UTC (permalink / raw)
  To: intel-gfx

On Tue, 26 Jun 2012 23:08:50 +0200
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> Having this early check in intel_ring_begin doesn't buy us anything,
> since we'll be calling into wait_request in the usual case already
> anyway. In the corner case of not waiting for free space using the
> last_retired_head we simply need to do the same check, too.
> 
> With these changes we'll only ever get an -EIO from intel_ring_begin
> if the gpu has truely been declared dead.
> 
> v2: Don't conflate the change to prevent intel_ring_begin from returning
> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> just prep work to avoid returning -EAGAIN in callsites that can't handle
> syscall restarting.

I'm really scared by this change. It's diffuclt to review because there
are SO many callers of intel_ring_begin, and figuring out if they all
work in the wedged case is simply too difficult. I've yet to review the
rest of the series, but it may make more sense to put this change last
perhaps?

> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index f30a53a..501546e 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
>  int intel_ring_begin(struct intel_ring_buffer *ring,
>  		     int num_dwords)
>  {
> -	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>  	int n = 4*num_dwords;
>  	int ret;
>  
> -	if (unlikely(atomic_read(&dev_priv->mm.wedged)))
> -		return -EIO;
> -
>  	if (unlikely(ring->tail + n > ring->effective_size)) {
>  		ret = intel_wrap_ring_buffer(ring);
>  		if (unlikely(ret))



-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-01  3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky
@ 2012-07-01 10:41   ` Daniel Vetter
  2012-07-02 16:04     ` Ben Widawsky
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Vetter @ 2012-07-01 10:41 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote:
> On Tue, 26 Jun 2012 23:08:50 +0200
> Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
>> Having this early check in intel_ring_begin doesn't buy us anything,
>> since we'll be calling into wait_request in the usual case already
>> anyway. In the corner case of not waiting for free space using the
>> last_retired_head we simply need to do the same check, too.
>>
>> With these changes we'll only ever get an -EIO from intel_ring_begin
>> if the gpu has truely been declared dead.
>>
>> v2: Don't conflate the change to prevent intel_ring_begin from returning
>> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
>> just prep work to avoid returning -EAGAIN in callsites that can't handle
>> syscall restarting.
>
> I'm really scared by this change. It's diffuclt to review because there
> are SO many callers of intel_ring_begin, and figuring out if they all
> work in the wedged case is simply too difficult. I've yet to review the
> rest of the series, but it may make more sense to put this change last
> perhaps?

Well, this patch doesn't really affect much what error codes the
callers get - we'll still throw both -EGAIN and -EIO at them (later on
patches will fix this).

What this patch does though is prevent us from returning too many
-EIO. Imagine the gpu died and we've noticed already (hence
dev_priv->mm.wedged is set), but some process is stuck churning
through the execbuf ioctl, holding dev->struct_mutex. While processing
the execbuf we call intel_ring_begin to emit a few commands. Now
usually, even when the gpu is dead, there is enough space in the ring
to do so, which allows us to complete the execbuf ioctl and then later
on we can block properly when trying to grab the mutex in the next
ioctl for the gpu reset work handler to complete.

But thanks to that wedged check in intel_ring_begin we'll instead
return -EIO, despite the fact that later on we could successfully
reset the gpu. Returning -EIO forces the X server to fall back to s/w
rendering and disabling dri2, and in the case of a 3d compositor
usually results in a abort. After this patch we can still return -EIO
if the gpu is dead but the reset work hasn't completed yet, but only
so if the ring is full (which in many cases is unlikely).

Cheers, Daniel

>
>>
>> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> ---
>>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ----
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
>> index f30a53a..501546e 100644
>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
>> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
>>  int intel_ring_begin(struct intel_ring_buffer *ring,
>>                    int num_dwords)
>>  {
>> -     struct drm_i915_private *dev_priv = ring->dev->dev_private;
>>       int n = 4*num_dwords;
>>       int ret;
>>
>> -     if (unlikely(atomic_read(&dev_priv->mm.wedged)))
>> -             return -EIO;
>> -
>>       if (unlikely(ring->tail + n > ring->effective_size)) {
>>               ret = intel_wrap_ring_buffer(ring);
>>               if (unlikely(ret))
>
>
>
> --
> Ben Widawsky, Intel Open Source Technology Center
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx



-- 
Daniel Vetter
daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-01 10:41   ` Daniel Vetter
@ 2012-07-02 16:04     ` Ben Widawsky
  2012-07-02 16:47       ` Daniel Vetter
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Widawsky @ 2012-07-02 16:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Sun, 1 Jul 2012 12:41:19 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote:
> > On Tue, 26 Jun 2012 23:08:50 +0200
> > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> >> Having this early check in intel_ring_begin doesn't buy us anything,
> >> since we'll be calling into wait_request in the usual case already
> >> anyway. In the corner case of not waiting for free space using the
> >> last_retired_head we simply need to do the same check, too.
> >>
> >> With these changes we'll only ever get an -EIO from intel_ring_begin
> >> if the gpu has truely been declared dead.
> >>
> >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> >> syscall restarting.
> >
> > I'm really scared by this change. It's diffuclt to review because there
> > are SO many callers of intel_ring_begin, and figuring out if they all
> > work in the wedged case is simply too difficult. I've yet to review the
> > rest of the series, but it may make more sense to put this change last
> > perhaps?
> 
> Well, this patch doesn't really affect much what error codes the
> callers get - we'll still throw both -EGAIN and -EIO at them (later on
> patches will fix this).
> 
> What this patch does though is prevent us from returning too many
> -EIO. Imagine the gpu died and we've noticed already (hence
> dev_priv->mm.wedged is set), but some process is stuck churning
> through the execbuf ioctl, holding dev->struct_mutex. While processing
> the execbuf we call intel_ring_begin to emit a few commands. Now
> usually, even when the gpu is dead, there is enough space in the ring
> to do so, which allows us to complete the execbuf ioctl and then later
> on we can block properly when trying to grab the mutex in the next
> ioctl for the gpu reset work handler to complete.

That in itself is a pretty big change, don't you think? It seems rather
strange and dangerous to modify HW (which we will if we allow execbuf to
continue when we write the tail pointer). I think we want some way to
not write the tail ptr in such a case. Now you might respond, well, who
cares about writes? But this imposes a pretty large restriction on any
code that can't work well after the GPU is hung.

I see the irony. I'm complaining that you can make GPU hangs unstable,
and the patch series is fixing GPU reset. Call it paranoia, it still
seems unsafe to me, and makes us have to think a bit more whenever
adding any code. Am I over-thinking this?

> 
> But thanks to that wedged check in intel_ring_begin we'll instead
> return -EIO, despite the fact that later on we could successfully
> reset the gpu. Returning -EIO forces the X server to fall back to s/w
> rendering and disabling dri2, and in the case of a 3d compositor
> usually results in a abort. After this patch we can still return -EIO
> if the gpu is dead but the reset work hasn't completed yet, but only
> so if the ring is full (which in many cases is unlikely).
> 
> Cheers, Daniel
> 
> >
> >>
> >> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >> ---
> >>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ----
> >>  1 file changed, 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> index f30a53a..501546e 100644
> >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
> >>  int intel_ring_begin(struct intel_ring_buffer *ring,
> >>                    int num_dwords)
> >>  {
> >> -     struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >>       int n = 4*num_dwords;
> >>       int ret;
> >>
> >> -     if (unlikely(atomic_read(&dev_priv->mm.wedged)))
> >> -             return -EIO;
> >> -
> >>       if (unlikely(ring->tail + n > ring->effective_size)) {
> >>               ret = intel_wrap_ring_buffer(ring);
> >>               if (unlikely(ret))
> >
> >
> >
> > --
> > Ben Widawsky, Intel Open Source Technology Center
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> 
> 



-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-02 16:04     ` Ben Widawsky
@ 2012-07-02 16:47       ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-07-02 16:47 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote:
> On Sun, 1 Jul 2012 12:41:19 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote:
> > > On Tue, 26 Jun 2012 23:08:50 +0200
> > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > >> Having this early check in intel_ring_begin doesn't buy us anything,
> > >> since we'll be calling into wait_request in the usual case already
> > >> anyway. In the corner case of not waiting for free space using the
> > >> last_retired_head we simply need to do the same check, too.
> > >>
> > >> With these changes we'll only ever get an -EIO from intel_ring_begin
> > >> if the gpu has truely been declared dead.
> > >>
> > >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> > >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> > >> syscall restarting.
> > >
> > > I'm really scared by this change. It's diffuclt to review because there
> > > are SO many callers of intel_ring_begin, and figuring out if they all
> > > work in the wedged case is simply too difficult. I've yet to review the
> > > rest of the series, but it may make more sense to put this change last
> > > perhaps?
> > 
> > Well, this patch doesn't really affect much what error codes the
> > callers get - we'll still throw both -EGAIN and -EIO at them (later on
> > patches will fix this).
> > 
> > What this patch does though is prevent us from returning too many
> > -EIO. Imagine the gpu died and we've noticed already (hence
> > dev_priv->mm.wedged is set), but some process is stuck churning
> > through the execbuf ioctl, holding dev->struct_mutex. While processing
> > the execbuf we call intel_ring_begin to emit a few commands. Now
> > usually, even when the gpu is dead, there is enough space in the ring
> > to do so, which allows us to complete the execbuf ioctl and then later
> > on we can block properly when trying to grab the mutex in the next
> > ioctl for the gpu reset work handler to complete.
> 
> That in itself is a pretty big change, don't you think? It seems rather
> strange and dangerous to modify HW (which we will if we allow execbuf to
> continue when we write the tail pointer). I think we want some way to
> not write the tail ptr in such a case. Now you might respond, well, who
> cares about writes? But this imposes a pretty large restriction on any
> code that can't work well after the GPU is hung.
> 
> I see the irony. I'm complaining that you can make GPU hangs unstable,
> and the patch series is fixing GPU reset. Call it paranoia, it still
> seems unsafe to me, and makes us have to think a bit more whenever
> adding any code. Am I over-thinking this?

I think so. It takes us a few seconds before we notice that the gpu died -
only when the hangcheck timer expired at least twice we'll declare the hw
wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer
emissions (and hence tail pointer writes) ;-)

So I think trying to avoid touching the gpu for another few usecs _is_
overthinking the problem, because fundamentally we can't avoid touching a
dead gpu - we will only ever notice its demise after the fact.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
                   ` (3 preceding siblings ...)
  2012-07-01  3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky
@ 2012-07-03 15:59 ` Chris Wilson
  2012-07-03 18:11   ` Daniel Vetter
  4 siblings, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2012-07-03 15:59 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

My experience with these patches is that they make it less likely that
the hang is reported to the userspace in a timely fashion (filling the
ring full leads to lots of lost rendering) and worse make it much more
likely that i915_gem_fault() hits an EIO and goes bang. That is
unacceptable and trivial to hit with these patches. I have not yet
reproduced that issue using the same broken renderer without these
patches.

I do think the patches are a step in the right direction, but with the
change in userspace behaviour it has to be a NAK for the time being.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-03 15:59 ` Chris Wilson
@ 2012-07-03 18:11   ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2012-07-03 18:11 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Intel Graphics Development

On Tue, Jul 3, 2012 at 5:59 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> My experience with these patches is that they make it less likely that
> the hang is reported to the userspace in a timely fashion (filling the
> ring full leads to lots of lost rendering) and worse make it much more
> likely that i915_gem_fault() hits an EIO and goes bang. That is
> unacceptable and trivial to hit with these patches. I have not yet
> reproduced that issue using the same broken renderer without these
> patches.

Hm, I don't see how these patches allow much more rendering to be
queued up until we stop everything - for single-threaded userspace
that should be at most one additional batch (which might have been the
one that could catch the spurious -EIO). All subsequent execbuf ioctl
calls should stall when trying to grab the mutex.

Same for the case that the gpu reset failed, userspace should be able
to submit one more batch afaict until it gets an -EIO. So can you
please dig into what exactly your seeing a bit more and unconfuse me?

> I do think the patches are a step in the right direction, but with the
> change in userspace behaviour it has to be a NAK for the time being.

Ok, I guess I'll have to throw the -EIO sigbus eater into the mix,
too. After all userspace is supposed to call set_domain(GTT) before
accessing the gtt mmap, so it should still get notice when the gpu has
died and rendering might be incomplete.

Imo we should still return -EIO for all ioctls, userspace should be
able to cope with these (In the future we might even add optional
behaviour to signal potentially dropped rendering due to a gpu reset
at wait_rendering for userspace that really cares). So would the
sigbus eater be good enough or do we need more?

Thanks, Daniel
-- 
Daniel Vetter
daniel.vetter@ffwll.ch - +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-07-03 18:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-26 21:08 [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
2012-06-26 21:08 ` [PATCH 2/4] drm/i915: don't trylock in the gpu reset code Daniel Vetter
2012-06-26 22:08   ` Łukasz Kuryło
2012-06-26 22:34     ` Daniel Vetter
2012-06-26 21:08 ` [PATCH 3/4] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
2012-06-27 15:19   ` Ben Widawsky
2012-06-27 16:36     ` Daniel Vetter
2012-06-26 21:08 ` [PATCH 4/4] drm/i915: don't hange userspace when the gpu reset is stuck Daniel Vetter
2012-07-01  3:09 ` [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Ben Widawsky
2012-07-01 10:41   ` Daniel Vetter
2012-07-02 16:04     ` Ben Widawsky
2012-07-02 16:47       ` Daniel Vetter
2012-07-03 15:59 ` Chris Wilson
2012-07-03 18:11   ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.