All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] reset rework, 2nd try
@ 2012-07-04 20:18 Daniel Vetter
  2012-07-04 20:18 ` [PATCH 1/5] drm/i915: don't trylock in the gpu reset code Daniel Vetter
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Hi all,

I took me a while to see the real issues Chris has been complaining about, but I
think the reworked patches take them all into account now. The big change is
that the wedged check at the beginning of intel_ring_begin stays, but gets
improved by properly deciding betweein -EIO and -EAGAIN. I've checked all the
callsites and couldn't find any issues.

Comments, flames, reviews and testing reports highly welcome.

I've also pushed the latest version of this to the reset-rework branch of my
personal git repo.

Cheers, Daniel

Daniel Vetter (5):
  drm/i915: don't trylock in the gpu reset code
  drm/i915: non-interruptible sleeps can't handle -EGAIN
  drm/i915: don't hang userspace when the gpu reset is stuck
  drm/i915: properly SIGBUS on I/O errors
  drm/i915: don't return a spurious -EIO from intel_ring_begin

 drivers/gpu/drm/i915/i915_drv.c         |    3 +--
 drivers/gpu/drm/i915/i915_drv.h         |    2 ++
 drivers/gpu/drm/i915/i915_gem.c         |   44 ++++++++++++++++++++++++-------
 drivers/gpu/drm/i915/intel_ringbuffer.c |   24 ++++++-----------
 4 files changed, 45 insertions(+), 28 deletions(-)

-- 
1.7.10

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] drm/i915: don't trylock in the gpu reset code
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
@ 2012-07-04 20:18 ` Daniel Vetter
  2012-07-04 20:18 ` [PATCH 2/5] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Simply failing to reset the gpu because someone else might still hold
the mutex isn't a great idea - I see reliable silent reset failures.
And gpu reset simply needs to be reliable and Just Work.

"But ... the deadlocks!"

We already kick all processes waiting for the gpu before launching the
reset work item. New waiters need to check the wedging state anyway
and then bail out. If we have places that can deadlock, we simply need
to fix them.

"But ... testing!"

We have the gpu hangman, and if the current gpu load gem_exec_nop
isn't good enough to hit a specific case, we can add a new one.

"But ...  don't we return -EAGAIN for non-interruptible calls to
wait_seqno now?"

Yep, but this problem already exists in the current code. A follow up
patch will remedy this by returning -EIO for non-interruptible sleeps
if the gpu died and the low-level wait bails out with -EAGAIN.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6edb2d5..e754cdf 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -730,8 +730,7 @@ int i915_reset(struct drm_device *dev)
 	if (!i915_try_reset)
 		return 0;
 
-	if (!mutex_trylock(&dev->struct_mutex))
-		return -EBUSY;
+	mutex_lock(&dev->struct_mutex);
 
 	i915_gem_reset(dev);
 
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] drm/i915: non-interruptible sleeps can't handle -EGAIN
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
  2012-07-04 20:18 ` [PATCH 1/5] drm/i915: don't trylock in the gpu reset code Daniel Vetter
@ 2012-07-04 20:18 ` Daniel Vetter
  2012-07-04 20:54   ` [PATCH] drm/i915: non-interruptible sleeps can't handle -EAGAIN Daniel Vetter
  2012-07-04 20:18 ` [PATCH 3/5] drm/i915: don't hang userspace when the gpu reset is stuck Daniel Vetter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

So don't return -EAGAIN, even in the case of a gpu hang. Remap it to
-EIO instead. Note that this isn't really an issue with
interruptability, but more that we have quite a few codepaths (mostly
around kms stuff) that simply can't handle any errors and hence not
even -EGAIN. Instead of adding proper failure paths so that we could
restart these ioctls we've opted for the cheap way out of sleeping
non-interruptibly.  Which works everywhere but when the gpu dies,
which this patch fixes.

So essentially interruptible == false means 'wait for the gpu or die
trying'.'

This patch is a bit ugly because intel_ring_begin is all non-interruptible
and hence only returns -EIO. But as the comment in there says,
auditing all the callsites would be a pain.

To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno
and intel_wait_ring_buffer. Also use the opportunity to clarify the
different cases in i915_gem_check_wedge a bit with comments.

v2: Don't access dev_priv->mm.interruptible from check_wedge - we
might not hold dev->struct_mutex, making this racy. Instead pass
interruptible in as a parameter. I've noticed this because I've hit a
BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been
added in

commit b4aca0106c466b5a0329318203f65bac2d91b682
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Wed Apr 25 20:50:12 2012 -0700

    drm/i915: extract some common olr+wedge code

although that commit is missing any justification for this. I guess
it's just copy&paste, because the same commit add the same BUG_ON
check to check_olr, where it indeed makes sense.

But in check_wedge everything we access is protected by other means,
so this is superflous. And because it now gets in the way (we add a
new caller in __wait_seqno, which can be called without
dev->struct_mutext) let's just remove it.

v3: Group all the i915_gem_check_wedge refactoring into this patch, so
that this patch here is all about not returning -EAGAIN to callsites
that can't handle syscall restarting.

v4: Add clarification what interuptible == fales means in our code,
requested by Ben Widawsky.

Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 ++
 drivers/gpu/drm/i915/i915_gem.c         |   26 ++++++++++++++++++--------
 drivers/gpu/drm/i915/intel_ringbuffer.c |    6 ++++--
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d839e4c..6c3a0bb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1336,6 +1336,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj)
 
 void i915_gem_retire_requests(struct drm_device *dev);
 void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring);
+int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+				      bool interruptible);
 
 void i915_gem_reset(struct drm_device *dev);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6a98c06..af6a510 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	mutex_unlock(&dev->struct_mutex);
 }
 
-static int
-i915_gem_check_wedge(struct drm_i915_private *dev_priv)
+int
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+		     bool interruptible)
 {
-	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
-
 	if (atomic_read(&dev_priv->mm.wedged)) {
 		struct completion *x = &dev_priv->error_completion;
 		bool recovery_complete;
@@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv)
 		recovery_complete = x->done > 0;
 		spin_unlock_irqrestore(&x->wait.lock, flags);
 
-		return recovery_complete ? -EIO : -EAGAIN;
+		/* Non-interruptible callers can't handle -EAGAIN, hence return
+		 * -EIO unconditionally for these. */
+		if (!interruptible)
+			return -EIO;
+
+		/* Recovery complete, but still wedged means reset failure. */
+		if (recovery_complete)
+			return -EIO;
+
+		return -EAGAIN;
 	}
 
 	return 0;
@@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 	unsigned long timeout_jiffies;
 	long end;
 	bool wait_forever = true;
+	int ret;
 
 	if (i915_seqno_passed(ring->get_seqno(ring), seqno))
 		return 0;
@@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 			end = wait_event_timeout(ring->irq_queue, EXIT_COND,
 						 timeout_jiffies);
 
-		if (atomic_read(&dev_priv->mm.wedged))
-			end = -EAGAIN;
+		ret = i915_gem_check_wedge(dev_priv, interruptible);
+		if (ret)
+			end = ret;
 	} while (end == 0 && wait_forever);
 
 	getrawmonotonic(&now);
@@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno)
 
 	BUG_ON(seqno == 0);
 
-	ret = i915_gem_check_wedge(dev_priv);
+	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index dce4d1a..cd35ad4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1228,8 +1228,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 		}
 
 		msleep(1);
-		if (atomic_read(&dev_priv->mm.wedged))
-			return -EAGAIN;
+
+		ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
+		if (ret)
+			return ret;
 	} while (!time_after(jiffies, end));
 	trace_i915_ring_wait_end(ring);
 	return -EBUSY;
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] drm/i915: don't hang userspace when the gpu reset is stuck
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
  2012-07-04 20:18 ` [PATCH 1/5] drm/i915: don't trylock in the gpu reset code Daniel Vetter
  2012-07-04 20:18 ` [PATCH 2/5] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
@ 2012-07-04 20:18 ` Daniel Vetter
  2012-07-04 20:18 ` [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors Daniel Vetter
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

With the gpu reset no longer using a trylock we've increased the
chances of userspace getting stuck quite a bit. To make that
(hopefully) rare case more paletable time out when waiting for the gpu
reset code to complete and signal this little issue to the caller by
returning -EIO.

This should help userspace to somewhat gracefully fall back and
hopefully allow the user to grab some logs and reboot the machine
(instead of staring at a frozen X screen in agony).

Suggested by Chris Wilson because I've been stubborn about allowing
the gpu reset code no to fail, ever (by removing the trylock).

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index af6a510..7d28555 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -96,9 +96,18 @@ i915_gem_wait_for_error(struct drm_device *dev)
 	if (!atomic_read(&dev_priv->mm.wedged))
 		return 0;
 
-	ret = wait_for_completion_interruptible(x);
-	if (ret)
+	/*
+	 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
+	 * userspace. If it takes that long something really bad is going on and
+	 * we should simply try to bail out and fail as gracefully as possible.
+	 */
+	ret = wait_for_completion_interruptible_timeout(x, 10*HZ);
+	if (ret == 0) {
+		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
+		return -EIO;
+	} else if (ret < 0) {
 		return ret;
+	}
 
 	if (atomic_read(&dev_priv->mm.wedged)) {
 		/* GPU is hung, bump the completion count to account for
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
                   ` (2 preceding siblings ...)
  2012-07-04 20:18 ` [PATCH 3/5] drm/i915: don't hang userspace when the gpu reset is stuck Daniel Vetter
@ 2012-07-04 20:18 ` Daniel Vetter
  2012-07-04 20:40   ` Daniel Vetter
  2012-07-04 20:18 ` [PATCH 5/5] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
  2012-07-04 20:54 ` [PATCH 0/5] reset rework, 2nd try Chris Wilson
  5 siblings, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

... instead of looping endless with no hope of ever serving that
page-fault. We only need to break out of this loop when the gpu died,
to run the reset work (and hopefully resurrect it).

This seems to have been lost in:

commit d9bc7e9f32716901c617e1f0fb6ce0f74f172686
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 7 13:09:31 2011 +0000

    drm/i915: Fix infinite loop regression from 21dd3734

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7d28555..2b54142 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1141,6 +1141,11 @@ unlock:
 out:
 	switch (ret) {
 	case -EIO:
+		/* If this -EIO is due to a gpu hang, give the reset code a
+		 * chance to clean up the mess. Otherwise return the proper
+		 * SIGBUS. */
+		if (!atomic_read(&dev_priv->mm.wedged))
+			return VM_FAULT_SIGBUS;
 	case -EAGAIN:
 		/* Give the error handler a chance to run and move the
 		 * objects off the GPU active list. Next time we service the
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
                   ` (3 preceding siblings ...)
  2012-07-04 20:18 ` [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors Daniel Vetter
@ 2012-07-04 20:18 ` Daniel Vetter
  2012-07-04 20:52   ` [PATCH] " Daniel Vetter
  2012-07-04 20:54 ` [PATCH 0/5] reset rework, 2nd try Chris Wilson
  5 siblings, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

The issue with this check is that it results in userspace receiving an
-EIO while the gpu reset hasn't completed, resulting in fallback to sw
rendering or worse.

Now there's also a stern comment in intel_ring_wait_seqno saying that
intel_ring_begin should not return -EAGAIN, ever, because some callers
can't handle that. But after an audit of the callsites I don't see any
issues. I guess the last problematic spot disappeared with the removal
of the pipelined fencing code.

So do the right thing and call check_wedge, which should properly
decide whether an -EAGIN or -EIO is appropriate if wedged is set.

Note that the early check for a wedged gpu before touching the ring is
rather important (and it took me quite some time of acting like the
densest doofus to figure that out): If we don't do that and the gpu
died for good, not having been resurrect by the reset code, userspace
can merrily fill up the entire ring until it notices that something is
amiss.

Allowing userspace to emit more render, despite that we know that it
will fail can't lead to anything good (and by experience can lead to
all sorts of havoc, including angering the OOM gods and hard-hanging
the hw for good).

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index cd35ad4..d42d821 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1117,20 +1117,9 @@ static int intel_wrap_ring_buffer(struct intel_ring_buffer *ring)
 
 static int intel_ring_wait_seqno(struct intel_ring_buffer *ring, u32 seqno)
 {
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	bool was_interruptible;
 	int ret;
 
-	/* XXX As we have not yet audited all the paths to check that
-	 * they are ready for ERESTARTSYS from intel_ring_begin, do not
-	 * allow us to be interruptible by a signal.
-	 */
-	was_interruptible = dev_priv->mm.interruptible;
-	dev_priv->mm.interruptible = false;
-
 	ret = i915_wait_seqno(ring, seqno);
-
-	dev_priv->mm.interruptible = was_interruptible;
 	if (!ret)
 		i915_gem_retire_requests_ring(ring);
 
@@ -1240,12 +1229,13 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 int intel_ring_begin(struct intel_ring_buffer *ring,
 		     int num_dwords)
 {
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	int n = 4*num_dwords;
 	int ret;
 
-	if (unlikely(atomic_read(&dev_priv->mm.wedged)))
-		return -EIO;
+	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
+	if (ret)
+		return ret;
 
 	if (unlikely(ring->tail + n > ring->effective_size)) {
 		ret = intel_wrap_ring_buffer(ring);
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors
  2012-07-04 20:18 ` [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors Daniel Vetter
@ 2012-07-04 20:40   ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:40 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

On Wed, Jul 04, 2012 at 10:18:42PM +0200, Daniel Vetter wrote:
> ... instead of looping endless with no hope of ever serving that
> page-fault. We only need to break out of this loop when the gpu died,
> to run the reset work (and hopefully resurrect it).

To clarify questions Chris raised on irc: This is about handling I/O
errors not from our own code, but e.g. when the disk died when trying to
swap in a gem bo. So this patch remidies the issue that the current
handling only handles gpu-death-induced cases of -EIO. Admittedly, dying
disks are much rarer than hanging gpus ...

I'll add that blurb to the commit.

-Daniel
> 
> This seems to have been lost in:
> 
> commit d9bc7e9f32716901c617e1f0fb6ce0f74f172686
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Feb 7 13:09:31 2011 +0000
> 
>     drm/i915: Fix infinite loop regression from 21dd3734
> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_gem.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 7d28555..2b54142 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1141,6 +1141,11 @@ unlock:
>  out:
>  	switch (ret) {
>  	case -EIO:
> +		/* If this -EIO is due to a gpu hang, give the reset code a
> +		 * chance to clean up the mess. Otherwise return the proper
> +		 * SIGBUS. */
> +		if (!atomic_read(&dev_priv->mm.wedged))
> +			return VM_FAULT_SIGBUS;
>  	case -EAGAIN:
>  		/* Give the error handler a chance to run and move the
>  		 * objects off the GPU active list. Next time we service the
> -- 
> 1.7.10
> 

-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] drm/i915: don't return a spurious -EIO from intel_ring_begin
  2012-07-04 20:18 ` [PATCH 5/5] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
@ 2012-07-04 20:52   ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:52 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

The issue with this check is that it results in userspace receiving an
-EIO while the gpu reset hasn't completed, resulting in fallback to sw
rendering or worse.

Now there's also a stern comment in intel_ring_wait_seqno saying that
intel_ring_begin should not return -EAGAIN, ever, because some callers
can't handle that. But after an audit of the callsites I don't see any
issues. I guess the last problematic spot disappeared with the removal
of the pipelined fencing code.

So do the right thing and call check_wedge, which should properly
decide whether an -EAGAIN or -EIO is appropriate if wedged is set.

Note that the early check for a wedged gpu before touching the ring is
rather important (and it took me quite some time of acting like the
densest doofus to figure that out): If we don't do that and the gpu
died for good, not having been resurrect by the reset code, userspace
can merrily fill up the entire ring until it notices that something is
amiss.

Allowing userspace to emit more render, despite that we know that it
will fail can't lead to anything good (and by experience can lead to
all sorts of havoc, including angering the OOM gods and hard-hanging
the hw for good).

v2: Fix EAGAIN mispell, noticed by Chris Wilson.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index cd35ad4..d42d821 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1117,20 +1117,9 @@ static int intel_wrap_ring_buffer(struct intel_ring_buffer *ring)
 
 static int intel_ring_wait_seqno(struct intel_ring_buffer *ring, u32 seqno)
 {
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	bool was_interruptible;
 	int ret;
 
-	/* XXX As we have not yet audited all the paths to check that
-	 * they are ready for ERESTARTSYS from intel_ring_begin, do not
-	 * allow us to be interruptible by a signal.
-	 */
-	was_interruptible = dev_priv->mm.interruptible;
-	dev_priv->mm.interruptible = false;
-
 	ret = i915_wait_seqno(ring, seqno);
-
-	dev_priv->mm.interruptible = was_interruptible;
 	if (!ret)
 		i915_gem_retire_requests_ring(ring);
 
@@ -1240,12 +1229,13 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 int intel_ring_begin(struct intel_ring_buffer *ring,
 		     int num_dwords)
 {
-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	int n = 4*num_dwords;
 	int ret;
 
-	if (unlikely(atomic_read(&dev_priv->mm.wedged)))
-		return -EIO;
+	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
+	if (ret)
+		return ret;
 
 	if (unlikely(ring->tail + n > ring->effective_size)) {
 		ret = intel_wrap_ring_buffer(ring);
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] reset rework, 2nd try
  2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
                   ` (4 preceding siblings ...)
  2012-07-04 20:18 ` [PATCH 5/5] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
@ 2012-07-04 20:54 ` Chris Wilson
  2012-07-05  8:04   ` Daniel Vetter
  5 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2012-07-04 20:54 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

On Wed,  4 Jul 2012 22:18:38 +0200, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Hi all,
> 
> I took me a while to see the real issues Chris has been complaining about, but I
> think the reworked patches take them all into account now. The big change is
> that the wedged check at the beginning of intel_ring_begin stays, but gets
> improved by properly deciding betweein -EIO and -EAGAIN. I've checked all the
> callsites and couldn't find any issues.
> 
> Comments, flames, reviews and testing reports highly welcome.

Other than being dense and requiring a big loud comment explaining that
'EIO without !wedged' means an EIO emanating from another subsystem,
this looks good to me and more importantly survives battery from the
pathological renderer.

With the spelling mistakes fixed and an improved comment for the
SIGBUS patch,
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>

The only last item on my wishlist for this series would be to capture
more of your discovery as code comments for next time...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] drm/i915: non-interruptible sleeps can't handle -EAGAIN
  2012-07-04 20:18 ` [PATCH 2/5] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
@ 2012-07-04 20:54   ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-04 20:54 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

So don't return -EAGAIN, even in the case of a gpu hang. Remap it to
-EIO instead. Note that this isn't really an issue with
interruptability, but more that we have quite a few codepaths (mostly
around kms stuff) that simply can't handle any errors and hence not
even -EAGAIN. Instead of adding proper failure paths so that we could
restart these ioctls we've opted for the cheap way out of sleeping
non-interruptibly.  Which works everywhere but when the gpu dies,
which this patch fixes.

So essentially interruptible == false means 'wait for the gpu or die
trying'.'

This patch is a bit ugly because intel_ring_begin is all non-interruptible
and hence only returns -EIO. But as the comment in there says,
auditing all the callsites would be a pain.

To avoid duplicating code, reuse i915_gem_check_wedge in __wait_seqno
and intel_wait_ring_buffer. Also use the opportunity to clarify the
different cases in i915_gem_check_wedge a bit with comments.

v2: Don't access dev_priv->mm.interruptible from check_wedge - we
might not hold dev->struct_mutex, making this racy. Instead pass
interruptible in as a parameter. I've noticed this because I've hit a
BUG_ON(!mutex_is_locked) at the top of check_wedge. This has been
added in

commit b4aca0106c466b5a0329318203f65bac2d91b682
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Wed Apr 25 20:50:12 2012 -0700

    drm/i915: extract some common olr+wedge code

although that commit is missing any justification for this. I guess
it's just copy&paste, because the same commit add the same BUG_ON
check to check_olr, where it indeed makes sense.

But in check_wedge everything we access is protected by other means,
so this is superflous. And because it now gets in the way (we add a
new caller in __wait_seqno, which can be called without
dev->struct_mutext) let's just remove it.

v3: Group all the i915_gem_check_wedge refactoring into this patch, so
that this patch here is all about not returning -EAGAIN to callsites
that can't handle syscall restarting.

v4: Add clarification what interuptible == fales means in our code,
requested by Ben Widawsky.

v5: Fix EAGAIN mispell noticed by Chris Wilson.

Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 ++
 drivers/gpu/drm/i915/i915_gem.c         |   26 ++++++++++++++++++--------
 drivers/gpu/drm/i915/intel_ringbuffer.c |    6 ++++--
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d839e4c..6c3a0bb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1336,6 +1336,8 @@ i915_gem_object_unpin_fence(struct drm_i915_gem_object *obj)
 
 void i915_gem_retire_requests(struct drm_device *dev);
 void i915_gem_retire_requests_ring(struct intel_ring_buffer *ring);
+int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+				      bool interruptible);
 
 void i915_gem_reset(struct drm_device *dev);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6a98c06..af6a510 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1863,11 +1863,10 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	mutex_unlock(&dev->struct_mutex);
 }
 
-static int
-i915_gem_check_wedge(struct drm_i915_private *dev_priv)
+int
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+		     bool interruptible)
 {
-	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
-
 	if (atomic_read(&dev_priv->mm.wedged)) {
 		struct completion *x = &dev_priv->error_completion;
 		bool recovery_complete;
@@ -1878,7 +1877,16 @@ i915_gem_check_wedge(struct drm_i915_private *dev_priv)
 		recovery_complete = x->done > 0;
 		spin_unlock_irqrestore(&x->wait.lock, flags);
 
-		return recovery_complete ? -EIO : -EAGAIN;
+		/* Non-interruptible callers can't handle -EAGAIN, hence return
+		 * -EIO unconditionally for these. */
+		if (!interruptible)
+			return -EIO;
+
+		/* Recovery complete, but still wedged means reset failure. */
+		if (recovery_complete)
+			return -EIO;
+
+		return -EAGAIN;
 	}
 
 	return 0;
@@ -1932,6 +1940,7 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 	unsigned long timeout_jiffies;
 	long end;
 	bool wait_forever = true;
+	int ret;
 
 	if (i915_seqno_passed(ring->get_seqno(ring), seqno))
 		return 0;
@@ -1963,8 +1972,9 @@ static int __wait_seqno(struct intel_ring_buffer *ring, u32 seqno,
 			end = wait_event_timeout(ring->irq_queue, EXIT_COND,
 						 timeout_jiffies);
 
-		if (atomic_read(&dev_priv->mm.wedged))
-			end = -EAGAIN;
+		ret = i915_gem_check_wedge(dev_priv, interruptible);
+		if (ret)
+			end = ret;
 	} while (end == 0 && wait_forever);
 
 	getrawmonotonic(&now);
@@ -2004,7 +2014,7 @@ i915_wait_seqno(struct intel_ring_buffer *ring, uint32_t seqno)
 
 	BUG_ON(seqno == 0);
 
-	ret = i915_gem_check_wedge(dev_priv);
+	ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index dce4d1a..cd35ad4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1228,8 +1228,10 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 		}
 
 		msleep(1);
-		if (atomic_read(&dev_priv->mm.wedged))
-			return -EAGAIN;
+
+		ret = i915_gem_check_wedge(dev_priv, dev_priv->mm.interruptible);
+		if (ret)
+			return ret;
 	} while (!time_after(jiffies, end));
 	trace_i915_ring_wait_end(ring);
 	return -EBUSY;
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] reset rework, 2nd try
  2012-07-04 20:54 ` [PATCH 0/5] reset rework, 2nd try Chris Wilson
@ 2012-07-05  8:04   ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2012-07-05  8:04 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Jul 04, 2012 at 09:54:09PM +0100, Chris Wilson wrote:
> On Wed,  4 Jul 2012 22:18:38 +0200, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > Hi all,
> > 
> > I took me a while to see the real issues Chris has been complaining about, but I
> > think the reworked patches take them all into account now. The big change is
> > that the wedged check at the beginning of intel_ring_begin stays, but gets
> > improved by properly deciding betweein -EIO and -EAGAIN. I've checked all the
> > callsites and couldn't find any issues.
> > 
> > Comments, flames, reviews and testing reports highly welcome.
> 
> Other than being dense and requiring a big loud comment explaining that
> 'EIO without !wedged' means an EIO emanating from another subsystem,
> this looks good to me and more importantly survives battery from the
> pathological renderer.
> 
> With the spelling mistakes fixed and an improved comment for the
> SIGBUS patch,
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> Tested-by: Chris Wilson <chris@chris-wilson.co.uk>

I've queued the entire series for next, thanks a lot for the review and
testing.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-07-05  8:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-04 20:18 [PATCH 0/5] reset rework, 2nd try Daniel Vetter
2012-07-04 20:18 ` [PATCH 1/5] drm/i915: don't trylock in the gpu reset code Daniel Vetter
2012-07-04 20:18 ` [PATCH 2/5] drm/i915: non-interruptible sleeps can't handle -EGAIN Daniel Vetter
2012-07-04 20:54   ` [PATCH] drm/i915: non-interruptible sleeps can't handle -EAGAIN Daniel Vetter
2012-07-04 20:18 ` [PATCH 3/5] drm/i915: don't hang userspace when the gpu reset is stuck Daniel Vetter
2012-07-04 20:18 ` [PATCH 4/5] drm/i915: properly SIGBUS on I/O errors Daniel Vetter
2012-07-04 20:40   ` Daniel Vetter
2012-07-04 20:18 ` [PATCH 5/5] drm/i915: don't return a spurious -EIO from intel_ring_begin Daniel Vetter
2012-07-04 20:52   ` [PATCH] " Daniel Vetter
2012-07-04 20:54 ` [PATCH 0/5] reset rework, 2nd try Chris Wilson
2012-07-05  8:04   ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.