[PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

* [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
@ 2013-05-24 10:45 Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2013-05-24 10:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, stable

If reset fails, the GPU is declared wedged. This ideally should never
happen, but very rarely it does. After the GPU is declared wedged, we
must allow userspace to continue to use its mapping of bo in order to
recover its data (and in some cases in order for memory management to
continue unabated). Obviously after the GPU is wedged, no bo are
currently accessed by the GPU and so we can complete any waits or domain
transitions away from the GPU. Currently, we fail this essential task
and instead report EIO and send a SIGBUS to the affected process -
causing major loss of data (by killing X or compiz).

Fixes regression from
commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Nov 15 17:17:22 2012 +0100

    drm/i915: clear up wedged transitions

v2: Add comments.

References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Damien Lespiau <damien.lespiau@intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 44da25e..ac05845 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -95,9 +95,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 	if (EXIT_COND)
 		return 0;
 
-	/* GPU is already declared terminally dead, give up. */
+	/* GPU is already declared terminally dead, nothing to wait for.
+	 * Return and let the ioctl continue. If we bail out here, then
+	 * we report EIO back to userspace (or worse SIGBUS through a
+	 * pagefault) when the caller is not necessarily interacting with
+	 * the device but is instead performing memory management. If the
+	 * application does instead want (or requires) to submit a GPU
+	 * command, then we will report the hung GPU (EIO) when we try
+	 * to acquire space on the ring.
+	 */
 	if (i915_terminally_wedged(error))
-		return -EIO;
+		return 0;
 
 	/*
 	 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +117,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 					       10*HZ);
 	if (ret == 0) {
 		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
-		return -EIO;
-	} else if (ret < 0) {
-		return ret;
-	}
+		/* The impossible happened, mark the device as terminally
+		 * wedged so that we fail quicker next time. If the reset
+		 * does eventually complete, the terminally wedged status
+		 * will be confirmed, or the counter reset.
+		 */
+		atomic_set(&error->reset_counter, I915_WEDGED);
+	} else if (ret > 0)
+		ret = 0;
 #undef EXIT_COND
 
-	return 0;
+	return ret;
 }
 
 int i915_mutex_lock_interruptible(struct drm_device *dev)
@@ -1211,10 +1223,13 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 
 	/* Try to flush the object off the GPU without holding the lock.
 	 * We will repeat the flush holding the lock in the normal manner
-	 * to catch cases where we are gazumped.
+	 * to catch cases where we are gazumped. Also because it is unlocked,
+	 * it is possible for a spurious GPU hang to occur whilst we wait.
+	 * In that event, just continue on and see if it confirmed by the
+	 * locked wait.
 	 */
 	ret = i915_gem_object_wait_rendering__nonblocking(obj, !write_domain);
-	if (ret)
+	if (ret && ret != -EIO)
 		goto unref;
 
 	if (read_domains & I915_GEM_DOMAIN_GTT) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread